If you use tSNE and UMAP only for visulaization of high-dimensional data, you probably have never thought about how much of global structure they can preserve. Indeed, both tSNE and UMAP were designed to predominanlty preserve local structure that is to group neigboring data points together which provides very informative visualization.
from IPython.display import Image
Image('/home/nikolay/Documents/Medium/tSNE_vs_UMAP/GyroidSculpture.jpg', width=1000)
However, if you want to make a next step beyond visualization and run clustering you will have difficulty doing it on the original data due to the Curse of Dimensionality and the choice of a proper distance metric, instead clustering on tSNE or UMAP reduced dimensions can be more promising. Below, I use scRNAseq data from Björklund et al. Innate Limphoid Cells (ILC) and compare K-means clustering on 1) raw expression data, 2) significant 30 Principal Components (PCs), 3) tSNE 2D representation, and 4) 30 UMAP components. As we can see, due to the Curse of Dimensionality and non-linearity of the scRNAseq data, K-means clustering failed on the raw expression data and the 30 PCs were apparently also too high-dimensional space for K-means to succeed. In contrast, non-linear dimension reduction down to 2 for tSNE and 30 for UMAP resulted in almost perfect agreement between clustering and tSNE dimension reduction.
from IPython.display import Image
Image('/home/nikolay/Documents/Medium/tSNE_vs_UMAP/Kmeans_tSNE_SignPCs_RawExpr_UMAPComp.png', width=2000)
However, this is where you really want to make sure that the tSNE and UMAP components capture enough of global structure in the original data, that is preserve distance between clusters of data points, in order to obtain a correct hierarchy between the data points or clusters of data points that were distant in high dimensions. Although not perfectly correct, in layman's terms one can say that:
It is widely acknowledged that clustering on 2D tSNE representation is not a good idea because the distances between clusters (global structure) are not guaranteed to be preserved, therefore proximity of two clusters on a tSNE plot does not imply a biological similarity between the two cell populations. Clustering on a 2D UMAP representation may be better idea since UMAP preserves more of a global structure, however this is still to be proven. What we definitely see from the plot above is that clustering on a number of UMAP components outperformes clustering on a number of PCs since 30 UMAP components retain more variation in the data than 30 PCs. In contrast, tSNE can not deliver more than 3 components due to technical limitations, so clustering on a number of tSNE components is basically impossible. So clustering - this is where you start seeing the difference between tSNE and UMAP.
Remember that the goal of dimension reduction is to transform the data from high- to low-dimensional space, i.e. represent the cells / samples in low-dimensional space without loosing too much information, i.e. preserving distances between both close and distant samples / cells. What exactly do we mean when we say that a dimension reduction algorithm is capable of preserving global structure? Both tSNE and UMAP define the probability to observe points at a certain distance to belong to the following exponential family:
$$p_{ij}\approx \displaystyle e^{\displaystyle -\frac{(x_i-x_j)^2}{2\sigma_i^2}}$$Here $\sigma_i$ is a parameter responsible for how much cells / samples can "feel" each other. Since $\sigma_i$ is a finite value, i.e. does not go yo infinity, every data point can "feel" the presence only its closest neighbors and not the distant points, therefore both tSNE and UMAP are neighbor graph algorithms and hence preserve local structure of the data. However, in the limit $\sigma_i \rightarrow \infty$ there is a chance that every point "remembers" every other point, so in this limit in theory both tSNE and UMAP can preserve global structure. However, it is not the $\sigma_i$ that the hyperparameter of tSNE and UMAP but the perplexity and number of nearest neigbors n_neigbors, respectively. Let us check what perplexity and n_neighbors values lead to the limit $\sigma_i \rightarrow \infty$. For this purpose we will take one syntetic and one real-world scRNAseq data sets and compute how mean $\sigma$ depend on the perplexity and n_neighbors.
To check how much PCA / MDS, tSNE and UMAP are capable of preserving global structure, let us contruct a syntetic data set representing the world map with 5 continents: Eurasia, Africa, North America, South America and Australia. We sample a few thousand points from the areas of the continents and will use them for testing the dimension reduction techniques.
import cartopy
import numpy as np
import cartopy.crs as ccrs
import matplotlib.pyplot as plt
import cartopy.feature as cfeature
from skimage.io import imread
import cartopy.io.shapereader as shpreader
shapename = 'admin_0_countries'
countries_shp = shpreader.natural_earth(resolution='110m',
category='cultural', name=shapename)
plt.figure(figsize = (20, 15))
ax = plt.axes(projection=ccrs.Miller())
ax.outline_patch.set_visible(False)
ax.set_extent([-180, 180, -50, 70])
for country in shpreader.Reader(countries_shp).records():
#print(country.attributes['NAME_LONG'])
if country.attributes['NAME_LONG'] in ['United States','Canada', 'Mexico']:
ax.add_geometries(country.geometry, ccrs.Miller(),
label=country.attributes['NAME_LONG'], color = 'black')
plt.savefig('NorthAmerica.png')
plt.close()
plt.figure(figsize = (20, 15))
ax = plt.axes(projection=ccrs.Miller())
ax.outline_patch.set_visible(False)
ax.set_extent([-180, 180, -50, 70])
for country in shpreader.Reader(countries_shp).records():
if country.attributes['NAME_LONG'] in ['Brazil','Argentina', 'Peru', 'Uruguay', 'Venezuela',
'Columbia', 'Bolivia', 'Colombia', 'Ecuador', 'Paraguay']:
ax.add_geometries(country.geometry, ccrs.Miller(),
label=country.attributes['NAME_LONG'], color = 'black')
plt.savefig('SouthAmerica.png')
plt.close()
plt.figure(figsize = (20, 15))
ax = plt.axes(projection=ccrs.Miller())
ax.outline_patch.set_visible(False)
ax.set_extent([-180, 180, -50, 70])
for country in shpreader.Reader(countries_shp).records():
if country.attributes['NAME_LONG'] in ['Australia']:
ax.add_geometries(country.geometry, ccrs.Miller(),
label=country.attributes['NAME_LONG'], color = 'black')
plt.savefig('Australia.png')
plt.close()
plt.figure(figsize = (20, 15))
ax = plt.axes(projection=ccrs.Miller())
ax.outline_patch.set_visible(False)
ax.set_extent([-180, 180, -50, 70])
for country in shpreader.Reader(countries_shp).records():
if country.attributes['NAME_LONG'] in ['Russian Federation', 'China', 'India', 'Kazakhstan', 'Mongolia',
'France', 'Germany', 'Spain', 'Ukraine', 'Turkey', 'Sweden',
'Finland', 'Denmark', 'Greece', 'Poland', 'Belarus', 'Norway',
'Italy', 'Iran', 'Pakistan', 'Afganistan', 'Iraq', 'Bulgaria',
'Romania', 'Turkmenistan', 'Uzbekistan' 'Austria', 'Ireland',
'United Kingdom', 'Saudi Arabia', 'Hungary']:
ax.add_geometries(country.geometry, ccrs.Miller(),
label=country.attributes['NAME_LONG'], color = 'black')
plt.savefig('Eurasia.png')
plt.close()
plt.figure(figsize = (20, 15))
ax = plt.axes(projection=ccrs.Miller())
ax.outline_patch.set_visible(False)
ax.set_extent([-180, 180, -50, 70])
for country in shpreader.Reader(countries_shp).records():
if country.attributes['NAME_LONG'] in ['Libya', 'Algeria', 'Niger', 'Marocco', 'Egypt', 'Sudan', 'Chad',
'Democratic Republic of the Congo', 'Somalia', 'Kenya', 'Ethiopia',
'The Gambia', 'Nigeria', 'Cameroon', 'Ghana', 'Guinea', 'Guinea-Bissau',
'Liberia', 'Sierra Leone', 'Burkina Faso', 'Central African Republic',
'Republic of the Congo', 'Gabon', 'Equatorial Guinea', 'Zambia',
'Malawi', 'Mozambique', 'Angola', 'Burundi', 'South Africa',
'South Sudan', 'Somaliland', 'Uganda', 'Rwanda', 'Zimbabwe', 'Tanzania',
'Botswana', 'Namibia', 'Senegal', 'Mali', 'Mauritania', 'Benin',
'Nigeria', 'Cameroon']:
ax.add_geometries(country.geometry, ccrs.Miller(),
label=country.attributes['NAME_LONG'], color = 'black')
plt.savefig('Africa.png')
plt.close()
rng = np.random.RandomState(123)
plt.figure(figsize=(20,15))
N_NorthAmerica = 10000
data_NorthAmerica = imread('NorthAmerica.png')[::-1, :, 0].T
X_NorthAmerica = rng.rand(4 * N_NorthAmerica, 2)
i, j = (X_NorthAmerica * data_NorthAmerica.shape).astype(int).T
X_NorthAmerica = X_NorthAmerica[data_NorthAmerica[i, j] < 1]
X_NorthAmerica = X_NorthAmerica[X_NorthAmerica[:, 1]<0.67]
y_NorthAmerica = np.array(['brown']*X_NorthAmerica.shape[0])
plt.scatter(X_NorthAmerica[:, 0], X_NorthAmerica[:, 1], c = 'brown', s = 50)
N_SouthAmerica = 10000
data_SouthAmerica = imread('SouthAmerica.png')[::-1, :, 0].T
X_SouthAmerica = rng.rand(4 * N_SouthAmerica, 2)
i, j = (X_SouthAmerica * data_SouthAmerica.shape).astype(int).T
X_SouthAmerica = X_SouthAmerica[data_SouthAmerica[i, j] < 1]
y_SouthAmerica = np.array(['red']*X_SouthAmerica.shape[0])
plt.scatter(X_SouthAmerica[:, 0], X_SouthAmerica[:, 1], c = 'red', s = 50)
N_Australia = 10000
data_Australia = imread('Australia.png')[::-1, :, 0].T
X_Australia = rng.rand(4 * N_Australia, 2)
i, j = (X_Australia * data_Australia.shape).astype(int).T
X_Australia = X_Australia[data_Australia[i, j] < 1]
y_Australia = np.array(['darkorange']*X_Australia.shape[0])
plt.scatter(X_Australia[:, 0], X_Australia[:, 1], c = 'darkorange', s = 50)
N_Eurasia = 10000
data_Eurasia = imread('Eurasia.png')[::-1, :, 0].T
X_Eurasia = rng.rand(4 * N_Eurasia, 2)
i, j = (X_Eurasia * data_Eurasia.shape).astype(int).T
X_Eurasia = X_Eurasia[data_Eurasia[i, j] < 1]
X_Eurasia = X_Eurasia[X_Eurasia[:, 0]>0.5]
X_Eurasia = X_Eurasia[X_Eurasia[:, 1]<0.67]
y_Eurasia = np.array(['blue']*X_Eurasia.shape[0])
plt.scatter(X_Eurasia[:, 0], X_Eurasia[:, 1], c = 'blue', s = 50)
N_Africa = 10000
data_Africa = imread('Africa.png')[::-1, :, 0].T
X_Africa = rng.rand(4 * N_Africa, 2)
i, j = (X_Africa * data_Africa.shape).astype(int).T
X_Africa = X_Africa[data_Africa[i, j] < 1]
y_Africa = np.array(['darkgreen']*X_Africa.shape[0])
plt.scatter(X_Africa[:, 0], X_Africa[:, 1], c = 'darkgreen', s = 50)
plt.show()
As a result we have a collection of 2D data points belonging to 5 clusters / continents. So far this is planar linear geometry, therefore linear dimensionality reduction techniques should be able to reconstruct the original data.
X = np.vstack((X_NorthAmerica, X_SouthAmerica, X_Australia, X_Eurasia, X_Africa))
y = np.concatenate((y_NorthAmerica, y_SouthAmerica, y_Australia, y_Eurasia, y_Africa))
print(X.shape)
print(y.shape)
We will start with linear dimensiona reduction techniques: PCA and MDS that can perfectly preserve global distances.
from sklearn.decomposition import PCA
X_reduced = PCA(n_components = 2).fit_transform(X)
plt.figure(figsize=(20,15))
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c = y, s = 50)
plt.title('Principal Component Analysis (PCA)', fontsize = 20)
plt.xlabel("PCA1", fontsize = 20)
plt.ylabel("PCA2", fontsize = 20)
plt.show()
from sklearn.manifold import MDS
model_mds = MDS(n_components = 2, random_state = 123, metric = True)
mmds = model_mds.fit_transform(X)
plt.figure(figsize=(20,15))
plt.scatter(mmds[:, 0], mmds[:, 1], c = y, s = 50)
plt.title('Metric Multi-Dimensional Scaling (MDS)', fontsize = 20)
plt.xlabel("MDS1", fontsize = 20); plt.ylabel("MDS2", fontsize = 20)
plt.show()
We confirm that up to linear transformations such as flip, shift and rotation, the original data set is very well reconstructed by the PCA and MDS linear dimension reduction techniques. Let us now check how non-linear dimensiona reduction techniques such as tSNE and UMAP perform on the 2D linear data. We deliberately select large perplexity and n_neighbors hyperparameters that should result in $\sigma_i \rightarrow \infty$ and therefore better preservation of the global structure.
from sklearn.manifold import TSNE
X_reduced = PCA(n_components = 2).fit_transform(X)
model = TSNE(learning_rate = 200, n_components = 2, random_state = 123, perplexity = 500,
init = X_reduced, n_iter = 1000, verbose = 2)
tsne = model.fit_transform(X)
plt.figure(figsize=(20,15))
plt.scatter(tsne[:, 0], tsne[:, 1], c = y, s = 50)
plt.title('tSNE', fontsize = 20); plt.xlabel("tSNE1", fontsize = 20); plt.ylabel("tSNE2", fontsize = 20)
plt.show()
import warnings
warnings.filterwarnings("ignore")
from umap import UMAP
X_reduced = PCA(n_components = 2).fit_transform(X)
model = UMAP(learning_rate = 1, n_components = 2, min_dist = 1, n_neighbors = 500,
init = X_reduced, n_epochs = 1000, verbose = 2)
umap = model.fit_transform(X)
plt.figure(figsize=(20,15))
plt.scatter(umap[:, 0], umap[:, 1], c = y, s = 50)
plt.title('UMAP', fontsize = 20); plt.xlabel("UMAP1", fontsize = 20); plt.ylabel("UMAP2", fontsize = 20)
plt.show()
The quality of visualizations are comparable between tSNE and UMAP in a sense that all the 5 clusters / continents are well distinguishable. However, we can see that the original shapes of the continents are a bit better preserved by UMAP. In addition the South America seems to be placed between Africa and North America by tSNE while it is correctly placed on the same longitude as North America by UMAP.
Previously, it was a 2D data point collection on the linear planar surface. Let us now embedd the 2D data points into the 3D non-linear manifold such as swiss roll. Swill roll represents a kind of Archimedean spiral in a 3D space.
from IPython.display import Image
Image('/home/nikolay/Documents/Medium/tSNE_vs_UMAP/SwissRoll.png', width=2000)
When we project our 2D world map onto 3D non-linear manifold, the intrinsic dimensionality of the data is still two even though they points are embedded into 3D space. Therefore linear dimension reduction techniques usually fail reconstructing data on a non-linear manifold as they try to preserve distances between all pairs of points including the ones that are not neighboring on a manifold. Let's project the world map onto the swiss roll.
z_3d = X[:, 1]
x_3d = X[:, 0] * np.cos(X[:, 0]*10)
y_3d = X[:, 0] * np.sin(X[:, 0]*10)
X_swiss_roll = np.array([x_3d, y_3d, z_3d]).T
X_swiss_roll.shape
from mpl_toolkits import mplot3d
plt.figure(figsize=(20,15))
ax = plt.axes(projection = '3d')
ax.scatter3D(X_swiss_roll[:, 0], X_swiss_roll[:, 1], X_swiss_roll[:, 2], c = y)
plt.show()
First we will check how linear dimension reduction such as PCA and MDS perform on the swiss roll.
from sklearn.decomposition import PCA
X_swiss_roll_reduced = PCA(n_components = 2).fit_transform(X_swiss_roll)
plt.figure(figsize=(20,15))
plt.scatter(X_swiss_roll_reduced[:, 0], X_swiss_roll_reduced[:, 1], c = y, s = 50)
plt.title('Principal Component Analysis (PCA)', fontsize = 20)
plt.xlabel("PCA1", fontsize = 20); plt.ylabel("PCA2", fontsize = 20)
plt.show()
from sklearn.manifold import MDS
model_mds = MDS(n_components = 2, random_state = 123, metric = True)
mds = model_mds.fit_transform(X_swiss_roll)
plt.figure(figsize=(20,15))
plt.scatter(mds[:, 0], mds[:, 1], c = y, s = 50)
plt.title('Metric Multi-Dimensional Scaling (MDS)', fontsize = 20)
plt.xlabel("MDS1", fontsize = 20); plt.ylabel("MDS2", fontsize = 20)
plt.show()
As expected both PCA nd MDS fail reconstructing the original data since they try to preserve global distances while for swiss roll it is more important to preserve local neighborhood. Let us now see whether tSNE and UMAP can do it better. Note that for both tSNE and UMAP we start with PCA as initialization for the gradient descent algorithm.
from sklearn.manifold import TSNE
X_swiss_roll_reduced = PCA(n_components = 2).fit_transform(X_swiss_roll)
model = TSNE(learning_rate = 200, n_components = 2, random_state = 123, perplexity = 50,
init = X_swiss_roll_reduced, n_iter = 1000, verbose = 2)
tsne = model.fit_transform(X_swiss_roll)
plt.figure(figsize=(20,15))
plt.scatter(tsne[:, 0], tsne[:, 1], c = y, s = 50)
plt.title('tSNE', fontsize = 20); plt.xlabel("tSNE1", fontsize = 20); plt.ylabel("tSNE2", fontsize = 20)
plt.show()
import warnings
warnings.filterwarnings("ignore")
from umap import UMAP
X_swiss_roll_reduced = PCA(n_components = 2).fit_transform(X_swiss_roll)
model = UMAP(learning_rate = 1, n_components = 2, min_dist = 1, n_neighbors = 500,
init = X_swiss_roll_reduced, n_epochs = 1000, verbose = 2)
umap = model.fit_transform(X_swiss_roll)
plt.figure(figsize=(20,15))
plt.scatter(umap[:, 0], umap[:, 1], c = y, s = 50)
plt.title('UMAP', fontsize = 20); plt.xlabel("UMAP1", fontsize = 20); plt.ylabel("UMAP2", fontsize = 20)
plt.show()
A very obvious bug of tSNE that we immediately see is that if one follows the line from South America towards Africa, one passes Eurasia that was placed by tSNE for some reason between South America and Africa. In contrast, UMAP correctly places Africa between South America and Eurasia.
Obviously, both tSNE and UMAP reconstructed the the original world map better than the PCA and MDS. This is because linear methods such as PCA and MDS get full affinity matrix as input and try to preserve distances between all pairs of points while non-linear neigbor graph methods such as tSNE / UMAP and Locally Linear Embedding (LLE) get sparse affinity matrix (KNN-graph) as input and preserve only distances between nearest neighbors.
from IPython.display import Image
Image('/home/nikolay/Documents/Medium/tSNE_vs_UMAP/MDS_vs_LLE.png', width=2000)
The quality of both tSNE and UMAP visualizations are comparable, although we used quite different hyperparameters to reach this similarity in the outcome. We used large learning rate 200 (default 200) and quite a low perplexity 50 (default 30) for tSNE, while we used small learning rate 1 (default 1) and large number of nearest neighbors n_neighbor = 500 (default 15) for UMAP. These are very important hyperparameters as they determine contributions from initialization and cost function to the final embeddings. From coding tSNE and UMAP from scratch it becomes clear that there are two major contributions to the global structure preservation when we run the gradient descent algorithm updating embeddings according to
$$y_i = y_i -\mu \frac{\partial \rm{Cost}}{\partial y_i}$$We see that the final embedding will depend on:
from IPython.display import Image
Image('/home/nikolay/Documents/Medium/tSNE_vs_UMAP/GradDesc.png', width=2000)
We are going to compare the two algorithms for their global structure preservation on the world map syntetic data set. People with biological background tend to ignore this approach and prefer to immediately jump to the real-world noisy biological data. However, these simple data sets allow to learn something fundamental about the algorithms, so here we will use more of the good old physics approach.
When we were running PCA / MDS and tSNE / UMAP on 2D linear and 3D non-linear data, we came to opposite conclusions:
Running tSNE and UMAP we started with PCA as an initialization. Therefore any difference in the final output can be only explained by the term with the cost function $\displaystyle \mu \frac{\partial \rm{Cost}}{\partial y_i}$. Let us compute how this term behaves for the world map syntetic data set. For this purpose we need to understand the dependence between perplexity and $\sigma$ parameter in the denominator of the power of the exponential probability to observe points at a certain distance
$$p_{ij}\approx \displaystyle e^{\displaystyle -\frac{(x_i-x_j)^2}{2\sigma_i^2}}$$Here we are going to compute the function $\sigma(Perplexity)$ for the world map syntetic data set.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import euclidean_distances
import warnings
warnings.filterwarnings("ignore")
#X_train = X;
path = '/home/nikolay/WABI/K_Pietras/Manifold_Learning/'
expr = pd.read_csv(path + 'bartoschek_filtered_expr_rpkm.txt', sep='\t')
print(expr.iloc[0:4,0:4])
X_train = expr.values[:,0:(expr.shape[1]-1)]
X_train = np.log(X_train + 1)
n = X_train.shape[0]
print("\nThis data set contains " + str(n) + " samples")
y_train = expr.values[:,expr.shape[1]-1]
print("\nDimensions of the data set: ")
print(X_train.shape, y_train.shape)
dist = np.square(euclidean_distances(X_train, X_train))
plt.figure(figsize=(20,15))
sns.distplot(dist.reshape(-1,1))
plt.title("HISTOGRAM OF EUCLIDEAN DISTANCES", fontsize = 20)
plt.xlabel("EUCLIDEAN DISTANCE", fontsize = 20); plt.ylabel("FREQUENCY", fontsize = 20)
plt.show()
def prob_high_dim(sigma, dist_row):
exp_distance = np.exp(-dist[dist_row] / (2*sigma**2))
exp_distance[dist_row] = 0
prob_not_symmetr = exp_distance / np.sum(exp_distance)
return prob_not_symmetr
def perplexity(prob):
return np.power(2, -np.sum([p*np.log2(p) for p in prob if p!=0]))
def sigma_binary_search(perp_of_sigma, fixed_perplexity):
sigma_lower_limit = 0; sigma_upper_limit = 1000
for i in range(20):
approx_sigma = (sigma_lower_limit + sigma_upper_limit) / 2
if perp_of_sigma(approx_sigma) < fixed_perplexity:
sigma_lower_limit = approx_sigma
else:
sigma_upper_limit = approx_sigma
if np.abs(fixed_perplexity - perp_of_sigma(approx_sigma)) <= 1e-5:
break
return approx_sigma
my_perp = []; my_sigma_tSNE = []
for PERPLEXITY in range(3, X_train.shape[0], 10):
prob = np.zeros((n,n)); sigma_array = []
for dist_row in range(n):
func = lambda sigma: perplexity(prob_high_dim(sigma, dist_row))
binary_search_result = sigma_binary_search(func, PERPLEXITY)
prob[dist_row] = prob_high_dim(binary_search_result, dist_row)
sigma_array.append(binary_search_result)
print("Perplexity = {0}, Mean Sigma = {1}".format(PERPLEXITY, np.mean(sigma_array)))
my_perp.append(PERPLEXITY)
my_sigma_tSNE.append(np.mean(sigma_array))
plt.figure(figsize=(20,15))
plt.plot(my_perp, my_sigma_tSNE, '-o')
plt.title("tSNE: Mean Sigma vs. Perplexity", fontsize = 20)
plt.xlabel("PERPLEXITY", fontsize = 20); plt.ylabel("MEAN SIGMA", fontsize = 20)
plt.show()
plt.figure(figsize=(20,15))
sns.distplot(prob.reshape(-1,1))
plt.title("tSNE: HIGH-DIMENSIONAL PROBABILITIES", fontsize = 20)
plt.xlabel("PROBABILITY", fontsize = 20); plt.ylabel("FREQUENCY", fontsize = 20)
plt.show()
plt.figure(figsize=(20,15))
sns.distplot(sigma_array)
plt.title("tSNE: Histogram of Sigma values", fontsize = 20)
plt.xlabel("SIGMA", fontsize = 20)
plt.ylabel("FREQUENCY", fontsize = 20)
plt.show()
Please note that the probability distribution of distances between data points in high dimensions corresponding to large perplexity equal to 713 is centered around a very small values 0.0013, implying that despite the perplexity is large, the high-dimensional probability still does not approach one, and not even close to one.
We can immediately see that at small perplexity the "memory" parameter $\sigma_i$ grows approximetely linearly with perplexity. However, when perplexity approaches the sample size N, the "memory" parameter $\sigma_i$ hyperbolically goes to infinity. We can approximate the behavior of $\sigma_i$ as a function of perplexity with the following simple assymptotic:
$$\sigma (\rm{Perp}) \approx \frac{\rm{Perp} / N}{1 - \rm{Perp} / N}$$Let us check how well this assymptotic describes the obtained function $\sigma (\rm{Perp})$:
from scipy import optimize
import matplotlib.pyplot as plt
import numpy as np
N = X_train.shape[0]
perp = np.array(my_perp)
sigma_exact = np.array(my_sigma_tSNE)
sigma = lambda perp, a, b, c: (c*(perp / N)**a) / (1 - (perp / N)**b)
p , _ = optimize.curve_fit(sigma, perp, sigma_exact)
print(p)
plt.figure(figsize=(20,15))
plt.plot(perp, sigma_exact, "o")
plt.plot(perp, sigma(perp, p[0], p[1], p[2]), c = "red")
plt.title("Non-Linear Least Square Fit", fontsize = 20)
plt.gca().legend(('Original', 'Fit'), fontsize = 20)
plt.xlabel("X", fontsize = 20); plt.ylabel("Y", fontsize = 20)
plt.show()
The fact the parameter $\sigma$ can in principal go to infinity at large perplexity means in practice that the contribution to the gradient descent from the cost function disappears and tSNE becomes heavily dominated by its initialization at large perplexities. Let us prove it from the functional form of the tSNE gradient of the cost function.
Both tSNE and UMAP start with an initialization (random, PCA or Laplacian Eigenmaps) and update the coordinates via the gradient descent algorithm. Here I will ignore normalization constants in the equations of probabilities:
$$y_i = y_i -\mu \frac{\partial KL}{\partial y_i}; \quad \frac{\partial KL}{\partial y_i} = 4\sum_j{(p_{ij}-q_{ij})(y_i-y_j)\frac{1}{1+(y_i-y_j)^2}}; \quad q_{ij}\approx \frac{1}{1+(y_i-y_j)^2}; \quad p_{ij}\approx \displaystyle e^{\displaystyle -\frac{(x_i-x_j)^2}{2\sigma_i^2}}$$In the limit $\sigma_i \rightarrow \infty$ the probability to observe points at a distance in high-dimensional space becomes $p_{ij} \rightarrow 1$. Therefore:
$$\frac{\partial KL}{\partial y_i} \approx 4\sum_j{\left(1-\frac{1}{1+(y_i-y_j)^2}\right)(y_i-y_j)\frac{1}{1+(y_i-y_j)^2}} = 4\sum_j{\frac{(y_i-y_j)^3}{(1+(y_i-y_j)^2)^2}}$$In the limit of close embedding points: $$y_i-y_j \rightarrow 0: \quad \frac{\partial KL}{\partial y_i} \approx 4\sum_i{(y_i-y_j)^3} \rightarrow 0$$ In the limit of distant embedding points: $$y_i-y_j \rightarrow \infty: \quad \frac{\partial KL}{\partial y_i} \approx 4\sum_i{\frac{1}{y_i-y_j}} \rightarrow 0$$
We conclude that the contribution to the gradient descent updating rule from the cost function, $\displaystyle\mu\frac{\partial KL}{\partial y_i}$, disappears and therefore if one starts with PCA as an initialization step one ends up with the PCA at the end as initial positions of the points are not getting updated by the gradient descent.
Let us now compute the $\sigma(\rm{n\_neighbor})$ dependence for UMAP and compare it to tSNE:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import euclidean_distances
dist = np.square(euclidean_distances(X_train, X_train))
rho = [sorted(dist[i])[1] for i in range(dist.shape[0])]
plt.figure(figsize=(20,15))
sns.distplot(dist.reshape(-1,1))
plt.title("HISTOGRAM OF EUCLIDEAN DISTANCES", fontsize = 20)
plt.xlabel("EUCLIDEAN DISTANCE", fontsize = 20); plt.ylabel("FREQUENCY", fontsize = 20)
plt.show()
def prob_high_dim(sigma, dist_row):
d = dist[dist_row] - rho[dist_row]
d[d < 0] = 0
return np.exp(- d / sigma)
def k(prob):
return np.power(2, np.sum(prob))
def sigma_binary_search(k_of_sigma, fixed_k):
sigma_lower_limit = 0; sigma_upper_limit = 1000
for i in range(20):
approx_sigma = (sigma_lower_limit + sigma_upper_limit) / 2
if k_of_sigma(approx_sigma) < fixed_k:
sigma_lower_limit = approx_sigma
else:
sigma_upper_limit = approx_sigma
if np.abs(fixed_k - k_of_sigma(approx_sigma)) <= 1e-5:
break
return approx_sigma
my_n_neighbor = []; my_sigma_umap = []
for N_NEIGHBOR in range(3, X_train.shape[0], 10):
prob = np.zeros((n,n)); sigma_array = []
for dist_row in range(n):
func = lambda sigma: k(prob_high_dim(sigma, dist_row))
binary_search_result = sigma_binary_search(func, N_NEIGHBOR)
prob[dist_row] = prob_high_dim(binary_search_result, dist_row)
sigma_array.append(binary_search_result)
print("N_neighbor = {0}, Mean Sigma = {1}".format(N_NEIGHBOR, np.mean(sigma_array)))
my_n_neighbor.append(N_NEIGHBOR)
my_sigma_umap.append(np.mean(sigma_array))
plt.figure(figsize=(20,15))
plt.plot(my_n_neighbor, my_sigma_umap, '-o')
plt.title("UMAP: Mean Sigma vs. N_neighbor", fontsize = 20)
plt.xlabel("N_NEIGHBOR", fontsize = 20); plt.ylabel("MEAN SIGMA", fontsize = 20)
plt.show()
plt.figure(figsize=(20,15))
sns.distplot(prob.reshape(-1,1))
plt.title("UMAP: HIGH-DIMENSIONAL PROBABILITIES", fontsize = 20)
plt.xlabel("PROBABILITY", fontsize = 20); plt.ylabel("FREQUENCY", fontsize = 20)
plt.show()
plt.figure(figsize=(20,15))
sns.distplot(sigma_array)
plt.title("UMAP: Histogram of Sigma values", fontsize = 20)
plt.xlabel("SIGMA", fontsize = 20)
plt.ylabel("FREQUENCY", fontsize = 20)
plt.show()
Please note that the high-dimensional probability values became much larger for UMAP and some small peak around 1 is quite visible.
We can see that $\sigma$ depends on n_neighbors in a very different way for UMAP compared to tSNE. It does not go to infinity that fast compared to tSNE. This implies that as Perplexity / N_neighbors approaches the sample size N, for UMAP $\sigma$ remains finite while for tSNE $\sigma$ goes to infinity. We can approximate $\sigma(\rm{n\_neighbor})$ dependence with the following simple expression:
$$\sigma(\rm{n\_neighbor})\approx -\frac{1}{\displaystyle\ln\left(\frac{\log_2(\rm{n\_neighbor})}{N}\right)}$$from scipy import optimize
import matplotlib.pyplot as plt
import numpy as np
N = X_train.shape[0]
n_neighbor = np.array(my_n_neighbor)
sigma_exact = np.array(my_sigma_umap)
sigma = lambda n_neighbor, a, b: - a / np.log(b * np.log2(n_neighbor) / N)
p , _ = optimize.curve_fit(sigma, n_neighbor, sigma_exact)
print(p)
plt.figure(figsize=(20,15))
plt.plot(n_neighbor, sigma_exact, "o")
plt.plot(n_neighbor, sigma(n_neighbor, p[0], p[1]), c = "red")
plt.title("Non-Linear Least Square Fit", fontsize = 20)
plt.gca().legend(('Original', 'Fit'), fontsize = 20)
plt.xlabel("X", fontsize = 20); plt.ylabel("Y", fontsize = 20)
plt.show()
Let us compare how Perplexity and N_neighbors hyperparameters behave for tSNE and UMAP, respectively, for the same data set where the Euclidean distances are fixed. For this purpose we need to realize that the computed sigma for UMAP is not directly comparable with the computed sigma for tSNE since we have $2\sigma^2$ in the denominator of the power of the exponent in the equation for high-dimensional probabilities, while it is just $\sigma$ in the corresponding denominator for UMAP. Therefore, we need to square all the obtained tSNE sigmas and multiply them by two.
my_sigma_tSNE_mod = [2*(i**2) for i in my_sigma_tSNE]
plt.figure(figsize=(20, 15))
plt.plot(my_perp, my_sigma_tSNE_mod, '-o')
plt.plot(my_n_neighbor, my_sigma_umap, '-o')
plt.gca().legend(('tSNE','UMAP'), fontsize = 20)
plt.title("Sigma vs. Perplexity / N_Neighbors for tSNE / UMAP", fontsize = 20)
plt.xlabel("PERPLEXITY / N_NEIGHBOR", fontsize = 20); plt.ylabel("MEAN SIGMA", fontsize = 20)
plt.show()
UMAP's mean sigma does not change much with n_neighbor and reaches much smaller values than tSNE's mean sigma, so really hard to compare them as they are not on the same sacel, log-transform did not improve the view, let us restrict the x-axis in order to enlarge and resolve the behavior of UMAP's mean sigma.
plt.figure(figsize=(20, 15))
plt.plot(my_perp, my_sigma_tSNE_mod, '-o')
plt.plot(my_n_neighbor, my_sigma_umap, '-o')
plt.gca().legend(('tSNE','UMAP'), fontsize = 20)
plt.title("Sigma vs. Perplexity / N_Neighbors for tSNE / UMAP", fontsize = 20)
plt.xlabel("PERPLEXITY / N_NEIGHBOR", fontsize = 20); plt.ylabel("MEAN SIGMA", fontsize = 20)
plt.ylim(0,600); plt.xlim(0,500)
plt.show()
Again we can see that for UMAP the dependence $\sigma(\rm{n\_neighbor})$ is logarithmic and therefore very slow while for tSNE $\sigma(\rm{Perplexity})$ very quickly goes to infinity when Perplexity approaches the sample size N. Therefore tSNE is much more sensitive to the Perplexity hyperparameter than UMAP towards N_neighbors hyperparameter.
In order to be on a safe side with our conclusion that mean sigma vs. n_neighbors goes to infinity much slower for UMAP than mean sigma vs. perplexity for tSNE, which was done for Cancer Associated Fibroblasts (CAFs) data set, we will recompute these curves for the synthetic World Map data set. We will do it first for the 2D (linear manifold) data set and then repeat it for the 3D Swiss Roll embedded data set (non-linear manifold).
import cartopy
import numpy as np
import cartopy.crs as ccrs
import matplotlib.pyplot as plt
import cartopy.feature as cfeature
from skimage.io import imread
import cartopy.io.shapereader as shpreader
shapename = 'admin_0_countries'
countries_shp = shpreader.natural_earth(resolution='110m',
category='cultural', name=shapename)
plt.figure(figsize = (20, 15))
ax = plt.axes(projection=ccrs.Miller())
ax.outline_patch.set_visible(False)
ax.set_extent([-180, 180, -50, 70])
for country in shpreader.Reader(countries_shp).records():
#print(country.attributes['NAME_LONG'])
if country.attributes['NAME_LONG'] in ['United States','Canada', 'Mexico']:
ax.add_geometries(country.geometry, ccrs.Miller(),
label=country.attributes['NAME_LONG'], color = 'black')
plt.savefig('NorthAmerica.png')
plt.close()
plt.figure(figsize = (20, 15))
ax = plt.axes(projection=ccrs.Miller())
ax.outline_patch.set_visible(False)
ax.set_extent([-180, 180, -50, 70])
for country in shpreader.Reader(countries_shp).records():
if country.attributes['NAME_LONG'] in ['Brazil','Argentina', 'Peru', 'Uruguay', 'Venezuela',
'Columbia', 'Bolivia', 'Colombia', 'Ecuador', 'Paraguay']:
ax.add_geometries(country.geometry, ccrs.Miller(),
label=country.attributes['NAME_LONG'], color = 'black')
plt.savefig('SouthAmerica.png')
plt.close()
plt.figure(figsize = (20, 15))
ax = plt.axes(projection=ccrs.Miller())
ax.outline_patch.set_visible(False)
ax.set_extent([-180, 180, -50, 70])
for country in shpreader.Reader(countries_shp).records():
if country.attributes['NAME_LONG'] in ['Australia']:
ax.add_geometries(country.geometry, ccrs.Miller(),
label=country.attributes['NAME_LONG'], color = 'black')
plt.savefig('Australia.png')
plt.close()
plt.figure(figsize = (20, 15))
ax = plt.axes(projection=ccrs.Miller())
ax.outline_patch.set_visible(False)
ax.set_extent([-180, 180, -50, 70])
for country in shpreader.Reader(countries_shp).records():
if country.attributes['NAME_LONG'] in ['Russian Federation', 'China', 'India', 'Kazakhstan', 'Mongolia',
'France', 'Germany', 'Spain', 'Ukraine', 'Turkey', 'Sweden',
'Finland', 'Denmark', 'Greece', 'Poland', 'Belarus', 'Norway',
'Italy', 'Iran', 'Pakistan', 'Afganistan', 'Iraq', 'Bulgaria',
'Romania', 'Turkmenistan', 'Uzbekistan' 'Austria', 'Ireland',
'United Kingdom', 'Saudi Arabia', 'Hungary']:
ax.add_geometries(country.geometry, ccrs.Miller(),
label=country.attributes['NAME_LONG'], color = 'black')
plt.savefig('Eurasia.png')
plt.close()
plt.figure(figsize = (20, 15))
ax = plt.axes(projection=ccrs.Miller())
ax.outline_patch.set_visible(False)
ax.set_extent([-180, 180, -50, 70])
for country in shpreader.Reader(countries_shp).records():
if country.attributes['NAME_LONG'] in ['Libya', 'Algeria', 'Niger', 'Marocco', 'Egypt', 'Sudan', 'Chad',
'Democratic Republic of the Congo', 'Somalia', 'Kenya', 'Ethiopia',
'The Gambia', 'Nigeria', 'Cameroon', 'Ghana', 'Guinea', 'Guinea-Bissau',
'Liberia', 'Sierra Leone', 'Burkina Faso', 'Central African Republic',
'Republic of the Congo', 'Gabon', 'Equatorial Guinea', 'Zambia',
'Malawi', 'Mozambique', 'Angola', 'Burundi', 'South Africa',
'South Sudan', 'Somaliland', 'Uganda', 'Rwanda', 'Zimbabwe', 'Tanzania',
'Botswana', 'Namibia', 'Senegal', 'Mali', 'Mauritania', 'Benin',
'Nigeria', 'Cameroon']:
ax.add_geometries(country.geometry, ccrs.Miller(),
label=country.attributes['NAME_LONG'], color = 'black')
plt.savefig('Africa.png')
plt.close()
rng = np.random.RandomState(123)
plt.figure(figsize = (20,15))
N_NorthAmerica = 10000
data_NorthAmerica = imread('NorthAmerica.png')[::-1, :, 0].T
X_NorthAmerica = rng.rand(4 * N_NorthAmerica, 2)
i, j = (X_NorthAmerica * data_NorthAmerica.shape).astype(int).T
X_NorthAmerica = X_NorthAmerica[data_NorthAmerica[i, j] < 1]
X_NorthAmerica = X_NorthAmerica[X_NorthAmerica[:, 1]<0.67]
y_NorthAmerica = np.array(['brown']*X_NorthAmerica.shape[0])
plt.scatter(X_NorthAmerica[:, 0], X_NorthAmerica[:, 1], c = 'brown', s = 50)
N_SouthAmerica = 10000
data_SouthAmerica = imread('SouthAmerica.png')[::-1, :, 0].T
X_SouthAmerica = rng.rand(4 * N_SouthAmerica, 2)
i, j = (X_SouthAmerica * data_SouthAmerica.shape).astype(int).T
X_SouthAmerica = X_SouthAmerica[data_SouthAmerica[i, j] < 1]
y_SouthAmerica = np.array(['red']*X_SouthAmerica.shape[0])
plt.scatter(X_SouthAmerica[:, 0], X_SouthAmerica[:, 1], c = 'red', s = 50)
N_Australia = 10000
data_Australia = imread('Australia.png')[::-1, :, 0].T
X_Australia = rng.rand(4 * N_Australia, 2)
i, j = (X_Australia * data_Australia.shape).astype(int).T
X_Australia = X_Australia[data_Australia[i, j] < 1]
y_Australia = np.array(['darkorange']*X_Australia.shape[0])
plt.scatter(X_Australia[:, 0], X_Australia[:, 1], c = 'darkorange', s = 50)
N_Eurasia = 10000
data_Eurasia = imread('Eurasia.png')[::-1, :, 0].T
X_Eurasia = rng.rand(4 * N_Eurasia, 2)
i, j = (X_Eurasia * data_Eurasia.shape).astype(int).T
X_Eurasia = X_Eurasia[data_Eurasia[i, j] < 1]
X_Eurasia = X_Eurasia[X_Eurasia[:, 0]>0.5]
X_Eurasia = X_Eurasia[X_Eurasia[:, 1]<0.67]
y_Eurasia = np.array(['blue']*X_Eurasia.shape[0])
plt.scatter(X_Eurasia[:, 0], X_Eurasia[:, 1], c = 'blue', s = 50)
N_Africa = 10000
data_Africa = imread('Africa.png')[::-1, :, 0].T
X_Africa = rng.rand(4 * N_Africa, 2)
i, j = (X_Africa * data_Africa.shape).astype(int).T
X_Africa = X_Africa[data_Africa[i, j] < 1]
y_Africa = np.array(['darkgreen']*X_Africa.shape[0])
plt.scatter(X_Africa[:, 0], X_Africa[:, 1], c = 'darkgreen', s = 50)
X = np.vstack((X_NorthAmerica, X_SouthAmerica, X_Australia, X_Eurasia, X_Africa))
y = np.concatenate((y_NorthAmerica, y_SouthAmerica, y_Australia, y_Eurasia, y_Africa))
print(X.shape)
print(y.shape)
plt.show()
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import euclidean_distances
import warnings
warnings.filterwarnings("ignore")
X_train = X
n = X_train.shape[0]
print("\nThis data set contains " + str(n) + " samples")
y_train = y
print("\nDimensions of the data set: ")
print(X_train.shape, y_train.shape)
dist = np.square(euclidean_distances(X_train, X_train))
plt.figure(figsize = (20,15))
sns.distplot(dist.reshape(-1,1))
plt.title("HISTOGRAM OF EUCLIDEAN DISTANCES", fontsize = 20)
plt.xlabel("EUCLIDEAN DISTANCE", fontsize = 20); plt.ylabel("FREQUENCY", fontsize = 20)
plt.show()
def prob_high_dim(sigma, dist_row):
exp_distance = np.exp(-dist[dist_row] / (2*sigma**2))
exp_distance[dist_row] = 0
prob_not_symmetr = exp_distance / np.sum(exp_distance)
return prob_not_symmetr
def perplexity(prob):
return np.power(2, -np.sum([p*np.log2(p) for p in prob if p!=0]))
def sigma_binary_search(perp_of_sigma, fixed_perplexity):
sigma_lower_limit = 0; sigma_upper_limit = 1000
for i in range(20):
approx_sigma = (sigma_lower_limit + sigma_upper_limit) / 2
if perp_of_sigma(approx_sigma) < fixed_perplexity:
sigma_lower_limit = approx_sigma
else:
sigma_upper_limit = approx_sigma
if np.abs(fixed_perplexity - perp_of_sigma(approx_sigma)) <= 1e-5:
break
return approx_sigma
my_perp = []; my_sigma_tSNE = []
for PERPLEXITY in range(3, X_train.shape[0], 200):
prob = np.zeros((n,n)); sigma_array = []
for dist_row in range(n):
func = lambda sigma: perplexity(prob_high_dim(sigma, dist_row))
binary_search_result = sigma_binary_search(func, PERPLEXITY)
prob[dist_row] = prob_high_dim(binary_search_result, dist_row)
sigma_array.append(binary_search_result)
print("Perplexity = {0}, Mean Sigma = {1}".format(PERPLEXITY, np.mean(sigma_array)))
my_perp.append(PERPLEXITY)
my_sigma_tSNE.append(np.mean(sigma_array))
plt.figure(figsize = (20,15))
plt.plot(my_perp, my_sigma_tSNE, '-o')
plt.title("tSNE: Mean Sigma vs. Perplexity", fontsize = 20)
plt.xlabel("PERPLEXITY", fontsize = 20); plt.ylabel("MEAN SIGMA", fontsize = 20)
plt.show()
plt.figure(figsize = (20,15))
sns.distplot(prob.reshape(-1,1))
plt.title("tSNE: HIGH-DIMENSIONAL PROBABILITIES", fontsize = 20)
plt.xlabel("PROBABILITY", fontsize = 20); plt.ylabel("FREQUENCY", fontsize = 20)
plt.show()
plt.figure(figsize = (20,15))
sns.distplot(sigma_array)
plt.title("tSNE: Histogram of Sigma values", fontsize = 20)
plt.xlabel("SIGMA", fontsize = 20)
plt.ylabel("FREQUENCY", fontsize = 20)
plt.show()
We can see that the Mean sigma vs. perplexity dependence for the World Map synthetic data set is qualitatively very similar to the one for the CAFs scRNAseq data set despite the sample size is three times larger now. We notice again that the high-dimensional probabilities are not close to one again despite we reach the large perplexity values close to the sample size of the data set.
Now let us check how UMAP's mean sigma vs. n_neighbors behaves for the synthetic World Map data set. Here I discovered that at default UMAP's hyperparameters local_connectivity = 1 and bandwidth = 1, the mean sigma is almost constant and has a very low values in the order of magnitude ~$10^{-5}$. Thus in order to compare it with tSNE's mean sigma vs. perplexity I had to increase bandwidth hyperparameter from default 1 to 200.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import euclidean_distances
import warnings
warnings.filterwarnings("ignore")
X_train = X
n = X_train.shape[0]
print("\nThis data set contains " + str(n) + " samples")
y_train = y
print("\nDimensions of the data set: ")
print(X_train.shape, y_train.shape)
BANDWIDTH = 200
dist = np.square(euclidean_distances(X_train, X_train))
rho = [sorted(dist[i])[1] for i in range(dist.shape[0])]
plt.figure(figsize = (20,15))
sns.distplot(dist.reshape(-1,1))
plt.title("HISTOGRAM OF EUCLIDEAN DISTANCES", fontsize = 20)
plt.xlabel("EUCLIDEAN DISTANCE", fontsize = 20); plt.ylabel("FREQUENCY", fontsize = 20)
plt.show()
def prob_high_dim(sigma, dist_row):
d = dist[dist_row] - rho[dist_row]
d[d < 0] = 0
return np.exp(- d / sigma)
def k(prob):
return np.power(2, np.sum(prob) / BANDWIDTH)
def sigma_binary_search(k_of_sigma, fixed_k):
sigma_lower_limit = 0; sigma_upper_limit = 1000
for i in range(20):
approx_sigma = (sigma_lower_limit + sigma_upper_limit) / 2
if k_of_sigma(approx_sigma) < fixed_k:
sigma_lower_limit = approx_sigma
else:
sigma_upper_limit = approx_sigma
if np.abs(fixed_k - k_of_sigma(approx_sigma)) <= 1e-5:
break
return approx_sigma
my_n_neighbor = []; my_sigma_umap = []
for N_NEIGHBOR in range(3, X_train.shape[0], 200):
prob = np.zeros((n,n)); sigma_array = []
for dist_row in range(n):
func = lambda sigma: k(prob_high_dim(sigma, dist_row))
binary_search_result = sigma_binary_search(func, N_NEIGHBOR)
prob[dist_row] = prob_high_dim(binary_search_result, dist_row)
sigma_array.append(binary_search_result)
print("N_neighbor = {0}, Mean Sigma = {1}".format(N_NEIGHBOR, np.mean(sigma_array)))
my_n_neighbor.append(N_NEIGHBOR)
my_sigma_umap.append(np.mean(sigma_array))
plt.figure(figsize = (20,15))
plt.plot(my_n_neighbor, my_sigma_umap, '-o')
plt.title("UMAP: Mean Sigma vs. N_neighbor", fontsize = 20)
plt.xlabel("N_NEIGHBOR", fontsize = 20); plt.ylabel("MEAN SIGMA", fontsize = 20)
plt.show()
plt.figure(figsize = (20,15))
sns.distplot(prob.reshape(-1,1))
plt.title("UMAP: HIGH-DIMENSIONAL PROBABILITIES", fontsize = 20)
plt.xlabel("PROBABILITY", fontsize = 20); plt.ylabel("FREQUENCY", fontsize = 20)
plt.show()
plt.figure(figsize = (20,15))
sns.distplot(sigma_array)
plt.title("UMAP: Histogram of Sigma values", fontsize = 20)
plt.xlabel("SIGMA", fontsize = 20)
plt.ylabel("FREQUENCY", fontsize = 20)
plt.show()
Note that the high-dimensional probabilities have the mode around 1, therefore there is a dramatic difference compared to tSNE where the high-dimensional probabilities are never close to 1. This is probably not only the effect of large n_neighbor hyperparameter but also the bandwidth which multiplies the log2(n_neighbor) and thus effectively increases the n_neighbor hyperparameter. In a sense, bandwidth has something to do with the empirical early exxageration hyperparameter of tSNE because increasing the n_neighbor value it effectively increases the high-dimensional probability values by some factor, the same way as the early exxageration increases the high-dimensional probabilities for tSNE.
from IPython.display import Image
Image('/home/nikolay/Documents/Medium/tSNE_vs_UMAP/bandwidth.png', width=2000)
from umap import umap_
plt.figure(figsize = (20, 15))
my_n_neighbors = []; my_sigma_umap = []
for n_neighbors in range(3, X_train.shape[0], 200):
sigmas_umap, rhos_umap = umap_.smooth_knn_dist(dist, k = n_neighbors,
local_connectivity = 1, bandwidth = 200)
my_sigma_umap.append(np.mean(sigmas_umap))
my_n_neighbors.append(n_neighbors)
print("N_neighbor = {0}, Mean Sigma = {1}".format(n_neighbors, np.mean(sigmas_umap)))
plt.plot(my_n_neighbors, my_sigma_umap, '-o')
plt.title("Sigma vs. N_neighbors: UMAP implementation of Leland McInnes", fontsize = 20)
plt.xlabel("N_NEIGHBORS", fontsize = 20); plt.ylabel("MEAN SIGMA", fontsize = 20)
plt.show()
Now we can plot mean sigma vs. perplexity (tSNE) and n_neighbors (UMAP) against each other. Here we need to take into account the catch that we have first power of sigma in the equation for high-dimensional probability for UMAP and second power of sigma, i.e. $2\sigma^2$, in the denomenator of the power of the exponent in the equation for high-dimensional probability for tSNE. Since the goal of this analysis is to compara how much denomenator of the power of the exponent is affected by increasing the number of nearest neighbors in the neighborhood graph (tSNE, UMAP), we need to compare $2\sigma^2$, which is tSNE denomenator of the power in the exponent, again $\sigma$ for UMAP.
my_sigma_tSNE_mod = [2*(i**2) for i in my_sigma_tSNE]
plt.figure(figsize=(20, 15))
plt.plot(my_perp, my_sigma_tSNE_mod, '-o')
plt.plot(my_n_neighbor, my_sigma_umap, '-o')
plt.gca().legend(('tSNE','UMAP'), fontsize = 20)
plt.title("Sigma vs. Perplexity / N_Neighbors for tSNE / UMAP", fontsize = 20)
plt.xlabel("PERPLEXITY / N_NEIGHBOR", fontsize = 20); plt.ylabel("MEAN SIGMA", fontsize = 20)
plt.show()
Increaseing the bandwidth yperparameter helped to put UMAP's and tSNE's mean sigma on the same scale for the World Map data set, however the main conclusion about logarithmic grows of the mean sigma as a function of n_neighbor remains the same. I.e. UMAP's mean sigma is less sensitive towards increasing n_neighbor than tSNE's mean sigma towards perplexity. UMAP's mean sigma does not diverge hyperbolically when n_neighbors approaches the samples sixe of the data set, in contrast to tSNE's mean sigma that jumps to infinity at perplexity equal to the sample size.
Let us now see how the above mean sigma vs. n_neighbor / perplexity dependence looks for the World Map embedded into the Swiss Roll 3D non-linear manifold.
z_3d = X[:, 1]
x_3d = X[:, 0] * np.cos(X[:, 0]*10)
y_3d = X[:, 0] * np.sin(X[:, 0]*10)
X_swiss_roll = np.array([x_3d, y_3d, z_3d]).T
X_swiss_roll.shape
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import euclidean_distances
import warnings
warnings.filterwarnings("ignore")
X_train = X_swiss_roll
n = X_train.shape[0]
print("\nThis data set contains " + str(n) + " samples")
y_train = y
print("\nDimensions of the data set: ")
print(X_train.shape, y_train.shape)
dist = np.square(euclidean_distances(X_train, X_train))
plt.figure(figsize = (20,15))
sns.distplot(dist.reshape(-1,1))
plt.title("HISTOGRAM OF EUCLIDEAN DISTANCES", fontsize = 20)
plt.xlabel("EUCLIDEAN DISTANCE", fontsize = 20); plt.ylabel("FREQUENCY", fontsize = 20)
plt.show()
def prob_high_dim(sigma, dist_row):
exp_distance = np.exp(-dist[dist_row] / (2*sigma**2))
exp_distance[dist_row] = 0
prob_not_symmetr = exp_distance / np.sum(exp_distance)
return prob_not_symmetr
def perplexity(prob):
return np.power(2, -np.sum([p*np.log2(p) for p in prob if p!=0]))
def sigma_binary_search(perp_of_sigma, fixed_perplexity):
sigma_lower_limit = 0; sigma_upper_limit = 1000
for i in range(20):
approx_sigma = (sigma_lower_limit + sigma_upper_limit) / 2
if perp_of_sigma(approx_sigma) < fixed_perplexity:
sigma_lower_limit = approx_sigma
else:
sigma_upper_limit = approx_sigma
if np.abs(fixed_perplexity - perp_of_sigma(approx_sigma)) <= 1e-5:
break
return approx_sigma
my_perp = []; my_sigma_tSNE = []
for PERPLEXITY in range(3, X_train.shape[0], 200):
prob = np.zeros((n,n)); sigma_array = []
for dist_row in range(n):
func = lambda sigma: perplexity(prob_high_dim(sigma, dist_row))
binary_search_result = sigma_binary_search(func, PERPLEXITY)
prob[dist_row] = prob_high_dim(binary_search_result, dist_row)
sigma_array.append(binary_search_result)
print("Perplexity = {0}, Mean Sigma = {1}".format(PERPLEXITY, np.mean(sigma_array)))
my_perp.append(PERPLEXITY)
my_sigma_tSNE.append(np.mean(sigma_array))
plt.figure(figsize = (20,15))
plt.plot(my_perp, my_sigma_tSNE, '-o')
plt.title("tSNE: Mean Sigma vs. Perplexity", fontsize = 20)
plt.xlabel("PERPLEXITY", fontsize = 20); plt.ylabel("MEAN SIGMA", fontsize = 20)
plt.show()
plt.figure(figsize = (20,15))
sns.distplot(prob.reshape(-1,1))
plt.title("tSNE: HIGH-DIMENSIONAL PROBABILITIES", fontsize = 20)
plt.xlabel("PROBABILITY", fontsize = 20); plt.ylabel("FREQUENCY", fontsize = 20)
plt.show()
plt.figure(figsize = (20,15))
sns.distplot(sigma_array)
plt.title("tSNE: Histogram of Sigma values", fontsize = 20)
plt.xlabel("SIGMA", fontsize = 20)
plt.ylabel("FREQUENCY", fontsize = 20)
plt.show()
Now let us see how mean sigma vs. n_neighbors looks like for UMAP performing on the Swiss Roll embedded 3D data set:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import euclidean_distances
import warnings
warnings.filterwarnings("ignore")
X_train = X_swiss_roll
n = X_train.shape[0]
print("\nThis data set contains " + str(n) + " samples")
y_train = y
print("\nDimensions of the data set: ")
print(X_train.shape, y_train.shape)
BANDWIDTH = 200
dist = np.square(euclidean_distances(X_train, X_train))
rho = [sorted(dist[i])[1] for i in range(dist.shape[0])]
plt.figure(figsize = (20,15))
sns.distplot(dist.reshape(-1,1))
plt.title("HISTOGRAM OF EUCLIDEAN DISTANCES", fontsize = 20)
plt.xlabel("EUCLIDEAN DISTANCE", fontsize = 20); plt.ylabel("FREQUENCY", fontsize = 20)
plt.show()
def prob_high_dim(sigma, dist_row):
d = dist[dist_row] - rho[dist_row]
d[d < 0] = 0
return np.exp(- d / sigma)
def k(prob):
return np.power(2, np.sum(prob) / BANDWIDTH)
def sigma_binary_search(k_of_sigma, fixed_k):
sigma_lower_limit = 0; sigma_upper_limit = 1000
for i in range(20):
approx_sigma = (sigma_lower_limit + sigma_upper_limit) / 2
if k_of_sigma(approx_sigma) < fixed_k:
sigma_lower_limit = approx_sigma
else:
sigma_upper_limit = approx_sigma
if np.abs(fixed_k - k_of_sigma(approx_sigma)) <= 1e-5:
break
return approx_sigma
my_n_neighbor = []; my_sigma_umap = []
for N_NEIGHBOR in range(3, X_train.shape[0], 200):
prob = np.zeros((n,n)); sigma_array = []
for dist_row in range(n):
func = lambda sigma: k(prob_high_dim(sigma, dist_row))
binary_search_result = sigma_binary_search(func, N_NEIGHBOR)
prob[dist_row] = prob_high_dim(binary_search_result, dist_row)
sigma_array.append(binary_search_result)
print("N_neighbor = {0}, Mean Sigma = {1}".format(N_NEIGHBOR, np.mean(sigma_array)))
my_n_neighbor.append(N_NEIGHBOR)
my_sigma_umap.append(np.mean(sigma_array))
plt.figure(figsize = (20,15))
plt.plot(my_n_neighbor, my_sigma_umap, '-o')
plt.title("UMAP: Mean Sigma vs. N_neighbor", fontsize = 20)
plt.xlabel("N_NEIGHBOR", fontsize = 20); plt.ylabel("MEAN SIGMA", fontsize = 20)
plt.show()
plt.figure(figsize = (20,15))
sns.distplot(prob.reshape(-1,1))
plt.title("UMAP: HIGH-DIMENSIONAL PROBABILITIES", fontsize = 20)
plt.xlabel("PROBABILITY", fontsize = 20); plt.ylabel("FREQUENCY", fontsize = 20)
plt.show()
plt.figure(figsize = (20,15))
sns.distplot(sigma_array)
plt.title("UMAP: Histogram of Sigma values", fontsize = 20)
plt.xlabel("SIGMA", fontsize = 20)
plt.ylabel("FREQUENCY", fontsize = 20)
plt.show()
my_sigma_tSNE_mod = [2*(i**2) for i in my_sigma_tSNE]
plt.figure(figsize=(20, 15))
plt.plot(my_perp, my_sigma_tSNE_mod, '-o')
plt.plot(my_n_neighbor, my_sigma_umap, '-o')
plt.gca().legend(('tSNE','UMAP'), fontsize = 20)
plt.title("Sigma vs. Perplexity / N_Neighbors for tSNE / UMAP", fontsize = 20)
plt.xlabel("PERPLEXITY / N_NEIGHBOR", fontsize = 20); plt.ylabel("MEAN SIGMA", fontsize = 20)
plt.show()
Again we see similar things for 3D data as for the 2D, the non-linear embedding into the Swiss Roll did not seem to change much in sense of mean sigma vs. perplexity / n_neighbor dependence. We had to use the bandwidth = 200 hyperparameter in order to be able to compare mean sigma for tSNE with mean sigma for UMAP on a similar scale.
Here we will try to understand the limit of high perplexity for tSNE and n_neighbors for UMAP. More specifically we will invistigate how much gradients in the gradient descent influence the initialization coordinates. For this purpose we again are going to use the syntetic World's Map data set where we know the ground truth, i.e. how far from each other and in which order continents are clocated on the map. We embed the World's Map into 3D non-linear manifold which is the Swiss Roll and compare tSNE vs. UMAP for the quality of the original data reconstruction.
import cartopy
import numpy as np
import cartopy.crs as ccrs
import matplotlib.pyplot as plt
import cartopy.feature as cfeature
from skimage.io import imread
import cartopy.io.shapereader as shpreader
shapename = 'admin_0_countries'
countries_shp = shpreader.natural_earth(resolution='110m',
category='cultural', name=shapename)
plt.figure(figsize = (20, 15))
ax = plt.axes(projection=ccrs.Miller())
ax.outline_patch.set_visible(False)
ax.set_extent([-180, 180, -50, 70])
for country in shpreader.Reader(countries_shp).records():
#print(country.attributes['NAME_LONG'])
if country.attributes['NAME_LONG'] in ['United States','Canada', 'Mexico']:
ax.add_geometries(country.geometry, ccrs.Miller(),
label=country.attributes['NAME_LONG'], color = 'black')
plt.savefig('NorthAmerica.png')
plt.close()
plt.figure(figsize = (20, 15))
ax = plt.axes(projection=ccrs.Miller())
ax.outline_patch.set_visible(False)
ax.set_extent([-180, 180, -50, 70])
for country in shpreader.Reader(countries_shp).records():
if country.attributes['NAME_LONG'] in ['Brazil','Argentina', 'Peru', 'Uruguay', 'Venezuela',
'Columbia', 'Bolivia', 'Colombia', 'Ecuador', 'Paraguay']:
ax.add_geometries(country.geometry, ccrs.Miller(),
label=country.attributes['NAME_LONG'], color = 'black')
plt.savefig('SouthAmerica.png')
plt.close()
plt.figure(figsize = (20, 15))
ax = plt.axes(projection=ccrs.Miller())
ax.outline_patch.set_visible(False)
ax.set_extent([-180, 180, -50, 70])
for country in shpreader.Reader(countries_shp).records():
if country.attributes['NAME_LONG'] in ['Australia']:
ax.add_geometries(country.geometry, ccrs.Miller(),
label=country.attributes['NAME_LONG'], color = 'black')
plt.savefig('Australia.png')
plt.close()
plt.figure(figsize = (20, 15))
ax = plt.axes(projection=ccrs.Miller())
ax.outline_patch.set_visible(False)
ax.set_extent([-180, 180, -50, 70])
for country in shpreader.Reader(countries_shp).records():
if country.attributes['NAME_LONG'] in ['Russian Federation', 'China', 'India', 'Kazakhstan', 'Mongolia',
'France', 'Germany', 'Spain', 'Ukraine', 'Turkey', 'Sweden',
'Finland', 'Denmark', 'Greece', 'Poland', 'Belarus', 'Norway',
'Italy', 'Iran', 'Pakistan', 'Afganistan', 'Iraq', 'Bulgaria',
'Romania', 'Turkmenistan', 'Uzbekistan' 'Austria', 'Ireland',
'United Kingdom', 'Saudi Arabia', 'Hungary']:
ax.add_geometries(country.geometry, ccrs.Miller(),
label=country.attributes['NAME_LONG'], color = 'black')
plt.savefig('Eurasia.png')
plt.close()
plt.figure(figsize = (20, 15))
ax = plt.axes(projection=ccrs.Miller())
ax.outline_patch.set_visible(False)
ax.set_extent([-180, 180, -50, 70])
for country in shpreader.Reader(countries_shp).records():
if country.attributes['NAME_LONG'] in ['Libya', 'Algeria', 'Niger', 'Marocco', 'Egypt', 'Sudan', 'Chad',
'Democratic Republic of the Congo', 'Somalia', 'Kenya', 'Ethiopia',
'The Gambia', 'Nigeria', 'Cameroon', 'Ghana', 'Guinea', 'Guinea-Bissau',
'Liberia', 'Sierra Leone', 'Burkina Faso', 'Central African Republic',
'Republic of the Congo', 'Gabon', 'Equatorial Guinea', 'Zambia',
'Malawi', 'Mozambique', 'Angola', 'Burundi', 'South Africa',
'South Sudan', 'Somaliland', 'Uganda', 'Rwanda', 'Zimbabwe', 'Tanzania',
'Botswana', 'Namibia', 'Senegal', 'Mali', 'Mauritania', 'Benin',
'Nigeria', 'Cameroon']:
ax.add_geometries(country.geometry, ccrs.Miller(),
label=country.attributes['NAME_LONG'], color = 'black')
plt.savefig('Africa.png')
plt.close()
rng = np.random.RandomState(123)
plt.figure(figsize = (20,15))
N_NorthAmerica = 10000
data_NorthAmerica = imread('NorthAmerica.png')[::-1, :, 0].T
X_NorthAmerica = rng.rand(4 * N_NorthAmerica, 2)
i, j = (X_NorthAmerica * data_NorthAmerica.shape).astype(int).T
X_NorthAmerica = X_NorthAmerica[data_NorthAmerica[i, j] < 1]
X_NorthAmerica = X_NorthAmerica[X_NorthAmerica[:, 1]<0.67]
y_NorthAmerica = np.array(['brown']*X_NorthAmerica.shape[0])
plt.scatter(X_NorthAmerica[:, 0], X_NorthAmerica[:, 1], c = 'brown', s = 50)
N_SouthAmerica = 10000
data_SouthAmerica = imread('SouthAmerica.png')[::-1, :, 0].T
X_SouthAmerica = rng.rand(4 * N_SouthAmerica, 2)
i, j = (X_SouthAmerica * data_SouthAmerica.shape).astype(int).T
X_SouthAmerica = X_SouthAmerica[data_SouthAmerica[i, j] < 1]
y_SouthAmerica = np.array(['red']*X_SouthAmerica.shape[0])
plt.scatter(X_SouthAmerica[:, 0], X_SouthAmerica[:, 1], c = 'red', s = 50)
N_Australia = 10000
data_Australia = imread('Australia.png')[::-1, :, 0].T
X_Australia = rng.rand(4 * N_Australia, 2)
i, j = (X_Australia * data_Australia.shape).astype(int).T
X_Australia = X_Australia[data_Australia[i, j] < 1]
y_Australia = np.array(['darkorange']*X_Australia.shape[0])
plt.scatter(X_Australia[:, 0], X_Australia[:, 1], c = 'darkorange', s = 50)
N_Eurasia = 10000
data_Eurasia = imread('Eurasia.png')[::-1, :, 0].T
X_Eurasia = rng.rand(4 * N_Eurasia, 2)
i, j = (X_Eurasia * data_Eurasia.shape).astype(int).T
X_Eurasia = X_Eurasia[data_Eurasia[i, j] < 1]
X_Eurasia = X_Eurasia[X_Eurasia[:, 0]>0.5]
X_Eurasia = X_Eurasia[X_Eurasia[:, 1]<0.67]
y_Eurasia = np.array(['blue']*X_Eurasia.shape[0])
plt.scatter(X_Eurasia[:, 0], X_Eurasia[:, 1], c = 'blue', s = 50)
N_Africa = 10000
data_Africa = imread('Africa.png')[::-1, :, 0].T
X_Africa = rng.rand(4 * N_Africa, 2)
i, j = (X_Africa * data_Africa.shape).astype(int).T
X_Africa = X_Africa[data_Africa[i, j] < 1]
y_Africa = np.array(['darkgreen']*X_Africa.shape[0])
plt.scatter(X_Africa[:, 0], X_Africa[:, 1], c = 'darkgreen', s = 50)
plt.show()
X = np.vstack((X_NorthAmerica, X_SouthAmerica, X_Australia, X_Eurasia, X_Africa))
y = np.concatenate((y_NorthAmerica, y_SouthAmerica, y_Australia, y_Eurasia, y_Africa))
print(X.shape)
print(y.shape)
Here we create the 3D Swiss Roll embedding of the World's Map data points. The internal dimensionality of the data is still 2D, therefore a good dimension reduction algorithms should be able to reconstruct the original World's Map.
z_3d = X[:, 1]
x_3d = X[:, 0] * np.cos(X[:, 0]*10)
y_3d = X[:, 0] * np.sin(X[:, 0]*10)
X_swiss_roll = np.array([x_3d, y_3d, z_3d]).T
X_swiss_roll.shape
Let us test how well tSNE can reconstruct the original 2D World's Map embedded into the 3D Swiss Roll. For this purpose, we will be gradually increasing the Perplexity value in order to check the hypothesis that tSNE is capable of reconstructing original data at large enough perplexities.
import matplotlib
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
import warnings
warnings.filterwarnings("ignore")
figure = plt.figure(figsize = (20, 15))
matplotlib.rcParams.update({'font.size': 22})
X_swiss_roll_reduced = PCA(n_components = 2).fit_transform(X_swiss_roll)
for perplexity_index, perplexity in enumerate([50, 500, 1000, 2000]):
print('Performing tSNE for Perplexity = {}'.format(perplexity))
model = TSNE(learning_rate = 200, n_components = 2, random_state = 123, perplexity = perplexity,
init = X_swiss_roll_reduced, n_iter = 1000, verbose = 0)
tsne = model.fit_transform(X_swiss_roll)
plt.subplot(221 + perplexity_index)
plt.scatter(tsne[:, 0], tsne[:, 1], c = y, s = 50)
plt.title('tSNE: Perplexity = {}'.format(perplexity), fontsize = 20)
plt.xlabel("tSNE1", fontsize = 20); plt.ylabel("tSNE2", fontsize = 20)
figure.tight_layout()
plt.show()
Despite the hypothesis, we can see that increasing perplexity values actually decrease the quality of the original data reconstruction. At perplexity = 50, the World's Map looks distorted but fair enough, while at perplexities = 500 and 1000, the World's Map becomes unreasonable elongated and the order of the continents is not preserved any more. The most astonishing picture we observe at perplexity = 2000, here the World Map look like an archimedian spiral, i.e. very similar to the PCA reconstruction. The algorithm obviously has problems with convergence, however even increasing learning rate and number of iteration did not help at all, you are welcome to check it. Here we confirm our old suspicion that gradients in the tSNE algorithm become close to zero at large perplexity, so the algorithm does not really improve the original PCA initialization. So if one runs tSNE with a PCA initialization and increases the perplexity, one ends up with nothing else than PCA. Let us check if UMAP can do it better at large n_neighbor hyperparameter values.
import matplotlib
from umap import UMAP
from sklearn.decomposition import PCA
import warnings
warnings.filterwarnings("ignore")
figure = plt.figure(figsize = (20, 15))
matplotlib.rcParams.update({'font.size': 22})
X_swiss_roll_reduced = PCA(n_components = 2).fit_transform(X_swiss_roll)
for n_neighbors_index, n_neighbors in enumerate([50, 500, 1000, 2000]):
print('Performing UMAP for n_neighbors = {}'.format(n_neighbors))
model = UMAP(learning_rate = 1, n_components = 2, min_dist = 2, n_neighbors = n_neighbors,
init = X_swiss_roll_reduced, n_epochs = 1000, verbose = 0, spread = 2)
umap = model.fit_transform(X_swiss_roll)
plt.subplot(221 + n_neighbors_index)
plt.scatter(umap[:, 0], umap[:, 1], c = y, s = 50)
plt.title('UMAP: n_neighbors = {}'.format(n_neighbors), fontsize = 20)
plt.xlabel("UMAP1", fontsize = 20); plt.ylabel("UMAP2", fontsize = 20)
figure.tight_layout()
plt.show()
Here we can see that UMAP is not particularly sensitive to the n_neighbor hyperparameter value, the UMAP visulaizations for n_neighbor = 50, 500, 1000 and 2000 are fairly comparable. This again confirms what we have learnt about sigma(n_neighbor) logarithmic dependence for UMAP, i.e. sigma does not reach large values even if one dramatically increases n_neighbor.
Let us plot how gradient of tSNE changes with increasing perplexity and prove that it becomes negligible compared to the initialization contribution into the gradient descent rule. We will use only gradients at the beginning of optimization, as this is where major changes happen, for simplicity we select a gradient after 50 iterations.
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
X_swiss_roll_reduced = PCA(n_components = 2).fit_transform(X_swiss_roll)
model = TSNE(learning_rate = 200, n_components = 2, random_state = 123, perplexity = 750,
init = X_swiss_roll_reduced, n_iter = 1000, verbose = 2)
tsne = model.fit_transform(X_swiss_roll)
plt.figure(figsize = (20,15))
plt.scatter(tsne[:, 0], tsne[:, 1], c = y, s = 50)
plt.title('tSNE', fontsize = 20); plt.xlabel("tSNE1", fontsize = 20); plt.ylabel("tSNE2", fontsize = 20)
plt.show()
tsne_perplexity = [3, 10, 20, 30, 50, 80, 100, 200, 500, 750, 1000, 1500, 2000]
tsne_perplexity
tsne_gradient = [0.0785314, 0.0414363, 0.0267256, 0.0240762, 0.0189936, 0.0184264, 0.0127485, 0.0105595,
0.0070886, 0.0040186, 0.0004717, 0.0000006, 0.0000004]
tsne_gradient
plt.figure(figsize = (20,15))
plt.plot(tsne_perplexity, tsne_gradient, "-o")
plt.title("tSNE Gradient vs. Perplexity", fontsize = 22)
plt.xlabel("Perplexity", fontsize = 22); plt.ylabel("tSNE Gradient", fontsize = 22)
plt.show()
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import euclidean_distances
N_LOW_DIMS = 2
MAX_ITER = 200
PERPLEXITY = 100
LEARNING_RATE = 0.1
path = '/home/nikolay/WABI/K_Pietras/Manifold_Learning/'
expr = pd.read_csv(path + 'bartoschek_filtered_expr_rpkm.txt', sep='\t')
print(expr.iloc[0:4,0:4])
X_train = expr.values[:,0:(expr.shape[1]-1)]; X_train = np.log(X_train + 1); n = X_train.shape[0]
print("\nThis data set contains " + str(n) + " samples")
y_train = expr.values[:,expr.shape[1]-1]
print("\nDimensions of the data set: ")
print(X_train.shape, y_train.shape)
print('\n')
dist = np.square(euclidean_distances(X_train, X_train))
X_reduced = PCA(n_components = 2).fit_transform(X_train)
def prob_high_dim(sigma, dist_row):
exp_distance = np.exp(-dist[dist_row] / (2*sigma**2))
exp_distance[dist_row] = 0
prob_not_symmetr = exp_distance / np.sum(exp_distance)
return prob_not_symmetr
def perplexity(prob):
return np.power(2, -np.sum([p*np.log2(p) for p in prob if p!=0]))
def sigma_binary_search(perp_of_sigma, fixed_perplexity):
sigma_lower_limit = 0; sigma_upper_limit = 1000
for i in range(20):
approx_sigma = (sigma_lower_limit + sigma_upper_limit) / 2
if perp_of_sigma(approx_sigma) < fixed_perplexity:
sigma_lower_limit = approx_sigma
else:
sigma_upper_limit = approx_sigma
if np.abs(fixed_perplexity - perp_of_sigma(approx_sigma)) <= 1e-5:
break
return approx_sigma
prob = np.zeros((n,n)); sigma_array = []
for dist_row in range(n):
func = lambda sigma: perplexity(prob_high_dim(sigma, dist_row))
binary_search_result = sigma_binary_search(func, PERPLEXITY)
prob[dist_row] = prob_high_dim(binary_search_result, dist_row)
sigma_array.append(binary_search_result)
if (dist_row + 1) % 100 == 0:
print("Sigma binary search finished {0} of {1} cells".format(dist_row + 1, n))
print("\nMean sigma = " + str(np.mean(sigma_array)))
P = prob + np.transpose(prob)
def prob_low_dim(Y):
inv_distances = np.power(1 + np.square(euclidean_distances(Y, Y)), -1)
np.fill_diagonal(inv_distances, 0.)
return inv_distances / np.sum(inv_distances, axis = 1, keepdims = True)
def KL(P, Y):
Q = prob_low_dim(Y)
return P * np.log(P + 0.01) - P * np.log(Q + 0.01)
def KL_gradient(P, Y):
Q = prob_low_dim(Y)
y_diff = np.expand_dims(Y, 1) - np.expand_dims(Y, 0)
inv_dist = np.power(1 + np.square(euclidean_distances(Y, Y)), -1)
return 4*np.sum(np.expand_dims(P - Q, 2) * y_diff * np.expand_dims(inv_dist, 2), axis = 1)
np.random.seed(12345)
#y = np.random.normal(loc = 0, scale = 1, size = (n, N_LOW_DIMS))
y = X_reduced
KL_array = []; KL_gradient_array = []
print("Running Gradient Descent: \n")
for i in range(MAX_ITER):
y = y - LEARNING_RATE * KL_gradient(P, y)
KL_array.append(np.sum(KL(P, y)))
KL_gradient_array.append(np.sum(KL_gradient(P, y)))
if i % 100 == 0:
print("KL divergence = " + str(np.sum(KL(P, y))))
plt.figure(figsize=(20,15))
plt.plot(KL_array,'-o')
plt.title("KL-divergence", fontsize = 20)
plt.xlabel("ITERATION", fontsize = 20); plt.ylabel("KL-DIVERGENCE", fontsize = 20)
plt.show()
plt.figure(figsize=(20,15))
plt.plot(KL_gradient_array,'-o')
plt.title("KL-divergence Gradient", fontsize = 20)
plt.xlabel("ITERATION", fontsize = 20); plt.ylabel("KL-DIVERGENCE GRADIENT", fontsize = 20)
plt.show()
I am a bit worried that the mean sigma reported by the scikitlearn implementation of tSNE is sligthly higher than the one determined in my implementation. To understand whether I have a bug in my code, I will dig into the scikitlearn and Rtsne codes. First let us run my implementation on Cancer Associated Fibroblasts (CAFs) scRNAseq data set:
import numpy as np; import pandas as pd; import seaborn as sns
from sklearn.metrics.pairwise import euclidean_distances; import matplotlib.pyplot as plt
path = '/home/nikolay/WABI/K_Pietras/Manifold_Learning/'
expr = pd.read_csv(path + 'bartoschek_filtered_expr_rpkm.txt', sep='\t')
print(expr.iloc[0:4,0:4])
X_train = expr.values[:,0:(expr.shape[1]-1)]; X_train = np.log(X_train + 1); n = X_train.shape[0]
print("\nThis data set contains " + str(n) + " samples")
y_train = expr.values[:,expr.shape[1]-1]
print("\nDimensions of the data set: ")
print(X_train.shape, y_train.shape)
dist = np.square(euclidean_distances(X_train, X_train))
def prob_high_dim(sigma, dist_row):
exp_distance = np.exp(-dist[dist_row] / (2*sigma**2))
exp_distance[dist_row] = 0
prob_not_symmetr = exp_distance / np.sum(exp_distance)
return prob_not_symmetr
def perplexity(prob):
return np.power(2, -np.sum([p*np.log2(p) for p in prob if p!=0]))
def sigma_binary_search(perp_of_sigma, fixed_perplexity):
sigma_lower_limit = 0; sigma_upper_limit = 1000
for i in range(20):
approx_sigma = (sigma_lower_limit + sigma_upper_limit) / 2
if perp_of_sigma(approx_sigma) < fixed_perplexity:
sigma_lower_limit = approx_sigma
else:
sigma_upper_limit = approx_sigma
if np.abs(fixed_perplexity - perp_of_sigma(approx_sigma)) <= 1e-5:
break
return approx_sigma
PERPLEXITY = 30
prob = np.zeros((n,n)); sigma_array = []
for dist_row in range(n):
func = lambda sigma: perplexity(prob_high_dim(sigma, dist_row))
binary_search_result = sigma_binary_search(func, PERPLEXITY)
prob[dist_row] = prob_high_dim(binary_search_result, dist_row)
sigma_array.append(binary_search_result)
print("Perplexity = {0}, Mean Sigma = {1}".format(PERPLEXITY, np.mean(sigma_array)))
As we can see, the mean sigma parameter is equal to 6.37 for Perplexity = 30. However, when we run the scikitlearn implementation of tSNE we get mean sigma equal to 7.54 for the same preplexity, see below:
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
X_reduced = PCA(n_components = 2).fit_transform(X_train)
model = TSNE(learning_rate = 200, n_components = 2, random_state = 123, perplexity = 30,
init = X_reduced, n_iter = 1000, method = 'exact', verbose = 2)
tsne = model.fit_transform(X_train)
This can be for eaxample because I calculate Euclidean distances somehow in a wrong way, let us see how scikitlearn calculates pairwise Euclidean distances:
from sklearn.metrics.pairwise import pairwise_distances
dist_X = pairwise_distances(X_train, metric = 'euclidean', squared = True)
dist_X
And now compare with the pairwise Euclidean distances computed in my code:
dist
The pairwise Euclidean distances look identical. So this is not where the discrepancy comes from. I figured out that scikitlearn calculates mean sigma via a binary_search_perplexity function from utils module. We can quickly reproduce the mean sigma equal to 7.54 without running the whole tSNE procedure:
from sklearn.manifold import _utils
conditional_P = _utils._binary_search_perplexity(np.float32(dist_X), desired_perplexity = 30, verbose = 2)
I also looked at the codes of the binary_search_perplexity function from here and also checked the C++ implementation used by the Rtsne wrapper from here. There are a few interesting discrepancies from my implementation of computing mean sigma.
First of all, scikitlearn and Rtsne use beta-parameter instead of sigma, where $\beta_i = 1 / (2\sigma_i^2)$, therefore $\sigma_i = \sqrt{1/(2\beta_i)}$ and $<\sigma> = \sqrt{1/(2<\beta>)}$, where $<\beta>=(1/N)\sum_i{\beta_i}\equiv (1/N)\beta_{\rm{sum}}$. MAking simple derivation we obtain: $<\sigma> = \sqrt{1/(2<\beta>)}=\sqrt{N/(2\beta_{\rm{sum}})}$. However we see that the coefficient 2 is ignored in the scikitlearn codes below at the very end. Moreover, from the definition of entropy $H=\sum_i{p_{ij}\log_2{p_{ij}}}$, however the base 2 of the logarithm is again ignored in the scikitlearn implementation and math.log, i.e. base 2.71, is used instead. In addition math.log(PERPLEXITY) is used for computing desired entropy, i.e. again base 2.71 instead of 2, while $\rm{Perplexity}=2^{\rm{entropy}}$ by definition. In summary, we obsreve quite a few discrepancies of the scikitlearn implementation from the mathematical formulation of the tSNE from Laurens vad der Maaten and Jeoffrey Hinton's original paper, therefore it looks like my implementation of computing sigma values is actually mathematically closer to the original algorithm than the scikitlearn implementation.
import math
my_perp = []; my_sigma = []
for PERPLEXITY in range(3, X_train.shape[0], 20):
n_steps = 100; desired_entropy = math.log(PERPLEXITY)
sqdistances = np.float32(dist_X)
n_samples = sqdistances.shape[0]; n_neighbors = sqdistances.shape[1]
PERPLEXITY_TOLERANCE = 1e-5; EPSILON_DBL = 1e-8; NPY_INFINITY = 1000
P = np.zeros((n_samples, n_neighbors), dtype=np.float64); beta_sum = 0.0
for i in range(n_samples):
beta_min = -NPY_INFINITY; beta_max = NPY_INFINITY; beta = 1.0
for l in range(n_steps):
sum_Pi = 0.0
for j in range(n_neighbors):
if j != i or n_neighbors < n_samples:
P[i, j] = math.exp(-sqdistances[i, j] * beta)
sum_Pi += P[i, j]
if sum_Pi == 0.0:
sum_Pi = EPSILON_DBL
sum_disti_Pi = 0.0
for j in range(n_neighbors):
P[i, j] /= sum_Pi
sum_disti_Pi += sqdistances[i, j] * P[i, j]
entropy = math.log(sum_Pi) + beta * sum_disti_Pi
entropy_diff = entropy - desired_entropy
if math.fabs(entropy_diff) <= PERPLEXITY_TOLERANCE:
break
if entropy_diff > 0.0:
beta_min = beta
if beta_max == NPY_INFINITY:
beta *= 2.0
else:
beta = (beta + beta_max) / 2.0
else:
beta_max = beta
if beta_min == -NPY_INFINITY:
beta /= 2.0
else:
beta = (beta + beta_min) / 2.0
beta_sum += beta
my_perp.append(PERPLEXITY)
my_sigma.append(np.mean(math.sqrt(n_samples / beta_sum)))
print("Perplexity = {0}, Mean sigma: {1}".format(PERPLEXITY, np.mean(math.sqrt(n_samples / beta_sum))))
plt.figure(figsize=(20, 15))
plt.plot(my_perp, my_sigma_tSNE, '-o')
plt.plot(my_perp, my_sigma, '-o')
plt.gca().legend(('tSNE My Implementation','tSNE Scikitlearn Implementation'), fontsize = 20)
plt.title("Sigma vs. Perplexity for tSNE", fontsize = 20)
plt.xlabel("PERPLEXITY", fontsize = 20); plt.ylabel("MEAN SIGMA", fontsize = 20)
plt.show()
Comparing how scikitlearn mean sigma and mean sigma computed in my tSNE implementation vary vs. Perplexity, we conclude that they behave very similarly quantitatively while the absolute values of the scikitlearn implementation are systematically a bit larger than the ones from my implementation of the tSNE algorithm.
Since mean sigma is not really reported by the original implementation of UMAP from Leland McInnes, I want to dig into UMAP codes and extract mean sigma in order to compare it with the mean sigma from my implementation of UMAP. Looking at the umap_.py script, I found a function smooth_knn_dist that seems to perform binary search and return arrays of sigma and rho (local connectivity parameter). The input of the function is the number of nearest neighbors k = n_neighbor for each point.
from umap import umap_
sigmas_umap, rhos_umap = umap_.smooth_knn_dist(dist, k = 716, bandwidth = 20)
np.mean(sigmas_umap)
Playing with the parameters of the smooth_knn_dist function I noticed that the mean sigma returned does not change that much when I vary n_neighbor. However, I discovered another parameter bandwidth that seems to have a dramatic effect on the mean sigma. Here I demonstrate that increasing bandwidth, we can in principle make mean sigma go to infinity.
from umap import umap_
plt.figure(figsize=(20, 15))
my_bandwidth_n_neighbors = []
my_bandwidth_sigma_umap = []
for bandwidth in [1, 5, 10, 20]:
my_n_neighbors = []; my_sigma_umap = []
for n_neighbors in range(3, X_train.shape[0], 20):
sigmas_umap, rhos_umap = umap_.smooth_knn_dist(dist, k = n_neighbors, bandwidth = bandwidth)
my_sigma_umap.append(np.mean(sigmas_umap))
my_n_neighbors.append(n_neighbors)
#print("UMAP N_neighbors = {0}, UMAP Mean sigma: {1}".format(n_neighbors, np.mean(sigmas_umap)))
my_bandwidth_sigma_umap.append(my_sigma_umap)
my_bandwidth_n_neighbors.append(my_n_neighbors)
print('Finished computing mean sigma for bandwidth = {}'.format(bandwidth))
plt.plot(my_n_neighbors, my_sigma_umap, '-o')
plt.gca().legend(['Bandwidth = {}'.format(bandwidth) for bandwidth in [1, 5, 10, 20]], fontsize = 20)
plt.title("Sigma vs. N_neighbors for UMAP", fontsize = 20)
plt.xlabel("N_NEIGHBORS", fontsize = 20); plt.ylabel("MEAN SIGMA", fontsize = 20)
plt.show()
The bandwidth parameter is connected with entropy and n_neighbors as:
$$\rm{n_{neighbors}} = 2^{\displaystyle\rm{entropy \,/\, bandwidth}}$$Without the bandwidth parameter (or when bandwidth = 1), the perplexity or number of nearest neighbors would be given by:
$$\rm{n_{neighbors}} = 2^{\displaystyle\rm{entropy}} = 2^{\displaystyle\sum_j{p_{ij}}}$$Therefore, the bandwidth parameter effectively increases the entropy or serves as a factor that multiplies the values of $\log_2{\rm{n_{neighbor}}}$ by bandwidth and hence raises the effective value of n_neighbor (aka perplexity) parameter.
Now, looking more carefully, we can see that the smooth_knn_dist function has another very important parameter local_connectivity that seems to affect the mean sigma a lot. Let us demonstrate it:
from umap import umap_
plt.figure(figsize=(20, 15))
my_local_connectivity_n_neighbors = []
my_local_connectivity_sigma_umap = []
for local_connectivity in [0, 0.1, 0.5, 1]:
my_n_neighbors = []; my_sigma_umap = []
for n_neighbors in range(3, X_train.shape[0], 20):
sigmas_umap, rhos_umap = umap_.smooth_knn_dist(dist, k = n_neighbors,
local_connectivity = local_connectivity)
my_sigma_umap.append(np.mean(sigmas_umap))
my_n_neighbors.append(n_neighbors)
my_local_connectivity_sigma_umap.append(my_sigma_umap)
my_local_connectivity_n_neighbors.append(my_n_neighbors)
print('Finished computing mean sigma for local_connectivity = {}'.format(local_connectivity))
plt.plot(my_n_neighbors, my_sigma_umap, '-o')
plt.gca().legend(['Local_connectivity = {}'.format(local_connectivity)
for local_connectivity in [0, 0.1, 0.5, 1]], fontsize = 20)
plt.title("Sigma vs. N_neighbors for UMAP", fontsize = 20)
plt.xlabel("N_NEIGHBORS", fontsize = 20); plt.ylabel("MEAN SIGMA", fontsize = 20)
plt.show()
It looks like at local_connectivity = 1, indeed, the mean sigma parameter does not change much as n_neighbor increases. However, when local connectivity decreases, mean sigma can jump up to 100 or even 400, it plateaus quite quickly, which agrees with the behavior from my implementation of UMAP.
Now let us understand why at bandwidth = 1 and local_connectivity = 1, that are the default UMAP parameters in Leland McInnes implementation, the mean sigma from Leland's implementation varies very little with n_neighbor, while in my implementation it varies a lot, let us check it:
from sklearn.metrics.pairwise import euclidean_distances
import numpy as np; import pandas as pd; import seaborn as sns; import matplotlib.pyplot as plt
dist = np.square(euclidean_distances(X_train, X_train))
rho = [sorted(dist[i])[1] for i in range(dist.shape[0])]
def prob_high_dim(sigma, dist_row):
d = dist[dist_row] - rho[dist_row]
d[d < 0] = 0
return np.exp(- d / sigma)
def k(prob):
return np.power(2, np.sum(prob))
def sigma_binary_search(k_of_sigma, fixed_k):
sigma_lower_limit = 0; sigma_upper_limit = 1000
for i in range(20):
approx_sigma = (sigma_lower_limit + sigma_upper_limit) / 2
if k_of_sigma(approx_sigma) < fixed_k:
sigma_lower_limit = approx_sigma
else:
sigma_upper_limit = approx_sigma
if np.abs(fixed_k - k_of_sigma(approx_sigma)) <= 1e-5:
break
return approx_sigma
my_n_neighbor = []; my_sigma_umap = []
for N_NEIGHBOR in range(3, X_train.shape[0], 20):
prob = np.zeros((n,n)); sigma_array = []
for dist_row in range(n):
func = lambda sigma: k(prob_high_dim(sigma, dist_row))
binary_search_result = sigma_binary_search(func, N_NEIGHBOR)
prob[dist_row] = prob_high_dim(binary_search_result, dist_row)
sigma_array.append(binary_search_result)
print("N_neighbor = {0}, Mean Sigma = {1}".format(N_NEIGHBOR, np.mean(sigma_array)))
my_n_neighbor.append(N_NEIGHBOR)
my_sigma_umap.append(np.mean(sigma_array))
plt.figure(figsize=(20,15))
plt.plot(my_n_neighbor, my_sigma_umap, '-o')
plt.title("UMAP: Mean Sigma vs. N_neighbor", fontsize = 20)
plt.xlabel("N_NEIGHBOR", fontsize = 20); plt.ylabel("MEAN SIGMA", fontsize = 20)
plt.show()
One thing that we can see immediately comparing my binary search implementation with Leland's one is that I seem to compute parameter rho very differently from the way Leland computes it. Indeed, the distribution of rho values from my implementation is unimodal, while from Leland's implementation it is bimodal, i.e. resembles actually the bimodal shape of the distribution of distances.
dist = np.square(euclidean_distances(X_train, X_train))
rho = [sorted(dist[i])[1] for i in range(dist.shape[0])]
plt.figure(figsize=(20,15))
sns.distplot(rho)
plt.show()
from umap import umap_
sigmas_umap, rhos_umap = umap_.smooth_knn_dist(dist, k = 30, bandwidth = 1, local_connectivity = 1)
plt.figure(figsize=(20,15))
sns.distplot(rhos_umap)
plt.show()
So let us further dig into Leland's codes and understand how he uses the localconnectivity hyperparameter for computing the rho values. Looking at the codes [here](https://github.com/lmcinnes/umap/blob/master/umap/umap.py) we can see that the local_connectivity parameter is aka the number of nearest neighbors, i.e. only firt nearest neighbor or first and second nearest neighbors, to subtruct from the rest of the distances for each data point. However, when determining what point is the nearest neighbor for each data point, Leland McInnes does not order or sort all the points by their distances, but takes only the first one in the order they appear in the distance matrix. Here is the piece of code from umap_.py that shows no sorting / ordering of the distances, the parameter rho for each data point gets zero's element of distnace matrx, this is because index-1 = 0 (since index = local_connectivity = 1), see below:
from IPython.display import Image
Image('/home/nikolay/Documents/Medium/tSNE_vs_UMAP/UMAP_no_sorting.png', width=2000)
Therefore, if the distance matrix was not pre-sorted, Leland's procedure seems to be very strange since my understanding from the original UMAP paper was that the parameter rho reflects the distance from each data point to its first nearest neighbor. Therefore some sorting must be used to dtermine what is the nearest neighbor for each data point. Let us demonstrate that if I do not sort neighbors in my UMAP implementation, I get identical to Lealnd's distribution of rho values.
rho = [dist[i][dist[i]>0][0] for i in range(dist.shape[0])]
plt.figure(figsize=(20,15))
sns.distplot(rho)
plt.show()
rho = np.zeros(dist.shape[0], dtype=np.float32)
for i in range(dist.shape[0]):
ith_distances = dist[i]
non_zero_dists = ith_distances[ith_distances > 0.0]
rho[i] = non_zero_dists[0]
plt.figure(figsize=(20,15))
sns.distplot(rho)
plt.show()
In addition, in the umap_.py codes there is some mysterious multiplication factor MIN_K_DIST_SCALE = 0.001 that perhaps is supposed to set a minimal value of sigma, which is the result variable in the code below, so if the sigma is too low, it is set to be the mean distance multiplied by MIN_K_DIST_SCALE:
from IPython.display import Image
Image('/home/nikolay/Documents/Medium/tSNE_vs_UMAP/UMAP_multiply_constant.png', width=2000)
Let us omit sorting data points by their proximity to each data point in my UMAP implementation and compute the mean sigma vs. n_nighbor dependence:
from sklearn.metrics.pairwise import euclidean_distances
import numpy as np; import pandas as pd; import seaborn as sns; import matplotlib.pyplot as plt
dist = np.square(euclidean_distances(X_train, X_train))
mean_distances = np.mean(dist)
#rho = [sorted(dist[i])[1] for i in range(dist.shape[0])]
rho = [dist[i][dist[i]>0][0] for i in range(dist.shape[0])]
def prob_high_dim(sigma, dist_row):
d = dist[dist_row] - rho[dist_row]
d[d < 0] = 0
return np.exp(- d / sigma)
def k(prob):
return np.power(2, np.sum(prob))
def sigma_binary_search(k_of_sigma, fixed_k):
sigma_lower_limit = 0; sigma_upper_limit = 1000
for i in range(20):
approx_sigma = (sigma_lower_limit + sigma_upper_limit) / 2
if k_of_sigma(approx_sigma) < fixed_k:
sigma_lower_limit = approx_sigma
else:
sigma_upper_limit = approx_sigma
if np.abs(fixed_k - k_of_sigma(approx_sigma)) <= 1e-5:
break
return approx_sigma
my_n_neighbor = []; my_sigma_umap = []
for N_NEIGHBOR in range(3, X_train.shape[0], 20):
prob = np.zeros((n,n)); sigma_array = []
for dist_row in range(n):
mean_ith_distances = np.mean(dist[dist_row])
func = lambda sigma: k(prob_high_dim(sigma, dist_row))
binary_search_result = sigma_binary_search(func, N_NEIGHBOR)
prob[dist_row] = prob_high_dim(binary_search_result, dist_row)
if binary_search_result < mean_ith_distances * 1e-3:
binary_search_result = mean_ith_distances * 1e-3
sigma_array.append(binary_search_result)
print("N_neighbor = {0}, Mean Sigma = {1}".format(N_NEIGHBOR, np.mean(sigma_array)))
my_n_neighbor.append(N_NEIGHBOR)
my_sigma_umap.append(np.mean(sigma_array))
plt.figure(figsize=(20,15))
plt.plot(my_n_neighbor, my_sigma_umap, '-o')
plt.title("UMAP: Mean Sigma vs. N_neighbor", fontsize = 20)
plt.xlabel("N_NEIGHBOR", fontsize = 20); plt.ylabel("MEAN SIGMA", fontsize = 20)
plt.show()
The values of mean sigma that we obtain are very close to the ones from the original UMAP binary search implementation for default local_connectivity = 1 and bandwidth = 1:
from umap import umap_
plt.figure(figsize=(20, 15))
my_n_neighbors = []; my_sigma_umap = []
for n_neighbors in range(3, X_train.shape[0], 20):
sigmas_umap, rhos_umap = umap_.smooth_knn_dist(dist, k = n_neighbors)
my_sigma_umap.append(np.mean(sigmas_umap))
my_n_neighbors.append(n_neighbors)
print("N_neighbor = {0}, Mean Sigma = {1}".format(n_neighbors, np.mean(sigmas_umap)))
plt.plot(my_n_neighbors, my_sigma_umap, '-o')
plt.title("Sigma vs. N_neighbors for UMAP", fontsize = 20)
plt.xlabel("N_NEIGHBORS", fontsize = 20); plt.ylabel("MEAN SIGMA", fontsize = 20)
plt.show()
Therefore, we conclude that we managed to reproduce the original Leland McInnes implementation of the binary search for sigma parameter. The two main discrepancies between my and Leland's computations of sigma was the absence of sorting in Leland's code (which is again very-very strange), and presence of some sort of minimal sigma that did not allow very small sigmas (probably because of singularities in that case) and set each sigma close to zero to mean_sigma * 0.001.
from umap import umap_
sigmas_umap, rhos_umap = umap_.smooth_knn_dist(dist, k = 703)
np.mean(sigmas_umap)
Now let us try to quantify how well such dimension reduction algorithms as PCA / MDS, tSNE and UMAP are capable of preserving global data structure. By global data structure we mean here: 1) the distances between the clustesrs, 2) the correlation between original and transformed centroid coordinates, and 3) the shapes of the clusters. We are going to use the 2D world map syntetic data set, and by shapes of the clusters we mean the sizes of the bounding boxes drawn around of the continetn / cluster.
import cartopy
import numpy as np
import cartopy.crs as ccrs
from skimage.io import imread
import matplotlib.pyplot as plt
import cartopy.feature as cfeature
from matplotlib import pyplot as plt
import cartopy.io.shapereader as shpreader
import warnings
warnings.filterwarnings("ignore")
shapename = 'admin_0_countries'
countries_shp = shpreader.natural_earth(resolution='110m',
category='cultural', name=shapename)
plt.figure(figsize = (20, 15))
ax = plt.axes(projection=ccrs.Miller())
ax.outline_patch.set_visible(False)
ax.set_extent([-180, 180, -50, 70])
for country in shpreader.Reader(countries_shp).records():
#print(country.attributes['NAME_LONG'])
if country.attributes['NAME_LONG'] in ['United States','Canada', 'Mexico']:
ax.add_geometries(country.geometry, ccrs.Miller(),
label=country.attributes['NAME_LONG'], color = 'black')
plt.savefig('NorthAmerica.png')
plt.close()
plt.figure(figsize = (20, 15))
ax = plt.axes(projection=ccrs.Miller())
ax.outline_patch.set_visible(False)
ax.set_extent([-180, 180, -50, 70])
for country in shpreader.Reader(countries_shp).records():
if country.attributes['NAME_LONG'] in ['Brazil','Argentina', 'Peru', 'Uruguay', 'Venezuela',
'Columbia', 'Bolivia', 'Colombia', 'Ecuador', 'Paraguay']:
ax.add_geometries(country.geometry, ccrs.Miller(),
label=country.attributes['NAME_LONG'], color = 'black')
plt.savefig('SouthAmerica.png')
plt.close()
plt.figure(figsize = (20, 15))
ax = plt.axes(projection=ccrs.Miller())
ax.outline_patch.set_visible(False)
ax.set_extent([-180, 180, -50, 70])
for country in shpreader.Reader(countries_shp).records():
if country.attributes['NAME_LONG'] in ['Australia']:
ax.add_geometries(country.geometry, ccrs.Miller(),
label=country.attributes['NAME_LONG'], color = 'black')
plt.savefig('Australia.png')
plt.close()
plt.figure(figsize = (20, 15))
ax = plt.axes(projection=ccrs.Miller())
ax.outline_patch.set_visible(False)
ax.set_extent([-180, 180, -50, 70])
for country in shpreader.Reader(countries_shp).records():
if country.attributes['NAME_LONG'] in ['Russian Federation', 'China', 'India', 'Kazakhstan', 'Mongolia',
'France', 'Germany', 'Spain', 'Ukraine', 'Turkey', 'Sweden',
'Finland', 'Denmark', 'Greece', 'Poland', 'Belarus', 'Norway',
'Italy', 'Iran', 'Pakistan', 'Afganistan', 'Iraq', 'Bulgaria',
'Romania', 'Turkmenistan', 'Uzbekistan' 'Austria', 'Ireland',
'United Kingdom', 'Saudi Arabia', 'Hungary']:
ax.add_geometries(country.geometry, ccrs.Miller(),
label=country.attributes['NAME_LONG'], color = 'black')
plt.savefig('Eurasia.png')
plt.close()
plt.figure(figsize = (20, 15))
ax = plt.axes(projection=ccrs.Miller())
ax.outline_patch.set_visible(False)
ax.set_extent([-180, 180, -50, 70])
for country in shpreader.Reader(countries_shp).records():
if country.attributes['NAME_LONG'] in ['Libya', 'Algeria', 'Niger', 'Marocco', 'Egypt', 'Sudan', 'Chad',
'Democratic Republic of the Congo', 'Somalia', 'Kenya', 'Ethiopia',
'The Gambia', 'Nigeria', 'Cameroon', 'Ghana', 'Guinea', 'Guinea-Bissau',
'Liberia', 'Sierra Leone', 'Burkina Faso', 'Central African Republic',
'Republic of the Congo', 'Gabon', 'Equatorial Guinea', 'Zambia',
'Malawi', 'Mozambique', 'Angola', 'Burundi', 'South Africa',
'South Sudan', 'Somaliland', 'Uganda', 'Rwanda', 'Zimbabwe', 'Tanzania',
'Botswana', 'Namibia', 'Senegal', 'Mali', 'Mauritania', 'Benin',
'Nigeria', 'Cameroon']:
ax.add_geometries(country.geometry, ccrs.Miller(),
label=country.attributes['NAME_LONG'], color = 'black')
plt.savefig('Africa.png')
plt.close()
rng = np.random.RandomState(123)
fig = plt.figure(figsize=(20,15))
N_NorthAmerica = 10000
data_NorthAmerica = imread('NorthAmerica.png')[::-1, :, 0].T
X_NorthAmerica = rng.rand(4 * N_NorthAmerica, 2)
i, j = (X_NorthAmerica * data_NorthAmerica.shape).astype(int).T
X_NorthAmerica = X_NorthAmerica[data_NorthAmerica[i, j] < 1]
X_NorthAmerica = X_NorthAmerica[X_NorthAmerica[:, 1]<0.67]
y_NorthAmerica = np.array(['brown']*X_NorthAmerica.shape[0])
plt.scatter(X_NorthAmerica[:, 0], X_NorthAmerica[:, 1], c = 'brown', s = 50)
N_SouthAmerica = 10000
data_SouthAmerica = imread('SouthAmerica.png')[::-1, :, 0].T
X_SouthAmerica = rng.rand(4 * N_SouthAmerica, 2)
i, j = (X_SouthAmerica * data_SouthAmerica.shape).astype(int).T
X_SouthAmerica = X_SouthAmerica[data_SouthAmerica[i, j] < 1]
y_SouthAmerica = np.array(['red']*X_SouthAmerica.shape[0])
plt.scatter(X_SouthAmerica[:, 0], X_SouthAmerica[:, 1], c = 'red', s = 50)
N_Australia = 10000
data_Australia = imread('Australia.png')[::-1, :, 0].T
X_Australia = rng.rand(4 * N_Australia, 2)
i, j = (X_Australia * data_Australia.shape).astype(int).T
X_Australia = X_Australia[data_Australia[i, j] < 1]
y_Australia = np.array(['darkorange']*X_Australia.shape[0])
plt.scatter(X_Australia[:, 0], X_Australia[:, 1], c = 'darkorange', s = 50)
N_Eurasia = 10000
data_Eurasia = imread('Eurasia.png')[::-1, :, 0].T
X_Eurasia = rng.rand(4 * N_Eurasia, 2)
i, j = (X_Eurasia * data_Eurasia.shape).astype(int).T
X_Eurasia = X_Eurasia[data_Eurasia[i, j] < 1]
X_Eurasia = X_Eurasia[X_Eurasia[:, 0]>0.5]
X_Eurasia = X_Eurasia[X_Eurasia[:, 1]<0.67]
y_Eurasia = np.array(['blue']*X_Eurasia.shape[0])
plt.scatter(X_Eurasia[:, 0], X_Eurasia[:, 1], c = 'blue', s = 50)
N_Africa = 10000
data_Africa = imread('Africa.png')[::-1, :, 0].T
X_Africa = rng.rand(4 * N_Africa, 2)
i, j = (X_Africa * data_Africa.shape).astype(int).T
X_Africa = X_Africa[data_Africa[i, j] < 1]
y_Africa = np.array(['darkgreen']*X_Africa.shape[0])
plt.scatter(X_Africa[:, 0], X_Africa[:, 1], c = 'darkgreen', s = 50)
NorthAmerica_bb_coords = [np.min(X_NorthAmerica[:,0]), np.max(X_NorthAmerica[:,0]),
np.min(X_NorthAmerica[:,1]), np.max(X_NorthAmerica[:,1])]
Eurasia_bb_coords = [np.min(X_Eurasia[:,0]), np.max(X_Eurasia[:,0]),
np.min(X_Eurasia[:,1]), np.max(X_Eurasia[:,1])]
Africa_bb_coords = [np.min(X_Africa[:,0]), np.max(X_Africa[:,0]),
np.min(X_Africa[:,1]), np.max(X_Africa[:,1])]
SouthAmerica_bb_coords = [np.min(X_SouthAmerica[:,0]), np.max(X_SouthAmerica[:,0]),
np.min(X_SouthAmerica[:,1]), np.max(X_SouthAmerica[:,1])]
Australia_bb_coords = [np.min(X_Australia[:,0]), np.max(X_Australia[:,0]),
np.min(X_Australia[:,1]), np.max(X_Australia[:,1])]
ax = fig.add_subplot(1, 1, 1)
rect1 = plt.Rectangle((NorthAmerica_bb_coords[0], NorthAmerica_bb_coords[2]),
NorthAmerica_bb_coords[1] - NorthAmerica_bb_coords[0],
NorthAmerica_bb_coords[3] - NorthAmerica_bb_coords[2],
fill = False, ec = 'brown')
rect2 = plt.Rectangle((Eurasia_bb_coords[0], Eurasia_bb_coords[2]),
Eurasia_bb_coords[1] - Eurasia_bb_coords[0],
Eurasia_bb_coords[3] - Eurasia_bb_coords[2],
fill = False, ec = 'blue')
rect3 = plt.Rectangle((Africa_bb_coords[0], Africa_bb_coords[2]),
Africa_bb_coords[1] - Africa_bb_coords[0],
Africa_bb_coords[3] - Africa_bb_coords[2],
fill = False, ec = 'darkgreen')
rect4 = plt.Rectangle((SouthAmerica_bb_coords[0], SouthAmerica_bb_coords[2]),
SouthAmerica_bb_coords[1] - SouthAmerica_bb_coords[0],
SouthAmerica_bb_coords[3] - SouthAmerica_bb_coords[2],
fill = False, ec = 'red')
rect5 = plt.Rectangle((Australia_bb_coords[0], Australia_bb_coords[2]),
Australia_bb_coords[1] - Australia_bb_coords[0],
Australia_bb_coords[3] - Australia_bb_coords[2],
fill = False, ec = 'darkorange')
ax.add_patch(rect1)
ax.add_patch(rect2)
ax.add_patch(rect3)
ax.add_patch(rect4)
ax.add_patch(rect5)
X = np.vstack((X_NorthAmerica, X_SouthAmerica, X_Australia, X_Eurasia, X_Africa))
y = np.concatenate((y_NorthAmerica, y_SouthAmerica, y_Australia, y_Eurasia, y_Africa))
print(X.shape)
print(y.shape)
plt.show()
Let us check how correlated pairwise distances between data points are after applying PCA, UMAP and tSNE to the syntetic World's Map data det. For this purpose we will use Spearman correlation between original distances and distances between data points after dimension reduction. We are going to apply bootstrapping, i.e. subsampling with replacement multiple times, for building confidence intervals for a more robust comparision of the dimension reduction algorithms.
import warnings
warnings.filterwarnings("ignore")
import random
from umap import UMAP
from sklearn.manifold import TSNE
from scipy.stats import spearmanr, pearsonr
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import euclidean_distances
coef_pca_list = []; coef_tsne_list = []; coef_umap_list = []
for i in range(10):
print('Working with iteration {}'.format(i + 1))
X_boot = X[random.sample(range(X.shape[0]), int(round(X.shape[0] * 0.9, 0))), :]
X_reduced = PCA(n_components = 2).fit_transform(X_boot)
model_tsne = TSNE(learning_rate = 200, n_components = 2, perplexity = 500,
init = X_reduced, n_iter = 1000, verbose = 0)
tsne = model_tsne.fit_transform(X_boot)
model_umap = UMAP(learning_rate = 1, n_components = 2, min_dist = 2, n_neighbors = 500,
init = X_reduced, n_epochs = 1000, verbose = 0, spread = 2)
umap = model_umap.fit_transform(X_boot)
dist_orig = np.square(euclidean_distances(X_boot, X_boot)).flatten()
dist_pca = np.square(euclidean_distances(X_reduced, X_reduced)).flatten()
dist_tsne = np.square(euclidean_distances(tsne, tsne)).flatten()
dist_umap = np.square(euclidean_distances(umap, umap)).flatten()
coef_pca, p_pca = spearmanr(dist_orig, dist_pca)
coef_tsne, p_tsne = spearmanr(dist_orig, dist_tsne)
coef_umap, p_umap = spearmanr(dist_orig, dist_umap)
coef_pca_list.append(coef_pca); coef_tsne_list.append(coef_tsne); coef_umap_list.append(coef_umap)
print('Spearman correlation coeffcient for PCA dimension reduction = {}'.format(coef_pca))
print('Spearman correlation coeffcient for tSNE dimension reduction = {}'.format(coef_tsne))
print('Spearman correlation coeffcient for UMAP dimension reduction = {}'.format(coef_umap))
print('****************************************************************************')
import matplotlib
plt.figure(figsize = (20, 15))
matplotlib.rcParams.update({'font.size': 22})
plt.boxplot([coef_pca_list, coef_umap_list, coef_tsne_list], labels = ['PCA', 'UMAP', 'tSNE'], patch_artist = True)
plt.ylabel('Spearman Correlation Coefficient', fontsize = 22)
plt.title('Correlation of original with reconstructed distances between data points', fontsize = 22)
plt.show()
from scipy.stats import mannwhitneyu
stat, p = mannwhitneyu(coef_umap_list, coef_tsne_list)
print('Statistics = %.3f, p = %.3f' % (stat, p))
We conclude that PCA perfectly reconstructs the original 2D World's Map data, which is not surprising because the World's Map is so far a linear manifold. So the Spearman correlation coefficient between the original and reconstructed distances is close to one. Both UMAP and tSNE preserve the majority of pairwise distances with the Spearman correlation coefficient > 0.9. However, as we can see from both the boxplot and Mann-Whittney U test, the Spearman correlation coefficient for UMAP is significantly higher than the one for tSNE, implying UMAP is superior in global structure preservation on a linear manifold even if one fixes the initialization to be PCA for both UMAP and tSNE removing thus this layer of uncertanty for comparison. Computing correlations betwen all pairs of points within and between clusters we can see that UMAP better preserves both local and global structure.
Let us now check how the dimension reduction techniques can preserve the distances between centroids of the clusters. This time we are not going to apply bootstrapping but use the whole data set in order to demonstrate an interesting effect about tSNE vs. UMAP which is the absense of stochasticity for tSNE and its presence for UMAP.
import warnings
warnings.filterwarnings("ignore")
import random
from umap import UMAP
from sklearn.manifold import TSNE
from scipy.stats import spearmanr
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import euclidean_distances
coef_pca_centroids_list = []; coef_tsne_centroids_list = []; coef_umap_centroids_list = []
for i in range(10):
print('Working with iteration {}'.format(i + 1))
#X_boot = X[random.sample(range(X.shape[0]), int(round(X.shape[0] * 0.8, 0))), :]
X = np.vstack((X_NorthAmerica, X_SouthAmerica, X_Australia, X_Eurasia, X_Africa))
X_centroids = np.vstack((np.mean(X_NorthAmerica, axis = 0),
np.mean(X_SouthAmerica, axis = 0),
np.mean(X_Australia, axis = 0),
np.mean(X_Eurasia, axis = 0),
np.mean(X_Africa, axis = 0)))
X_reduced = PCA(n_components = 2).fit_transform(X)
X_pca_centroids = np.vstack((np.mean(X_reduced[y == 'brown'], axis = 0),
np.mean(X_reduced[y == 'red'], axis = 0),
np.mean(X_reduced[y == 'darkorange'], axis = 0),
np.mean(X_reduced[y == 'blue'], axis = 0),
np.mean(X_reduced[y == 'darkgreen'], axis = 0)))
model_tsne = TSNE(learning_rate = 200, n_components = 2, perplexity = 500,
init = X_reduced, n_iter = 1000, verbose = 0)
tsne = model_tsne.fit_transform(X)
X_tsne_centroids = np.vstack((np.mean(tsne[y == 'brown'], axis = 0),
np.mean(tsne[y == 'red'], axis = 0),
np.mean(tsne[y == 'darkorange'], axis = 0),
np.mean(tsne[y == 'blue'], axis = 0),
np.mean(tsne[y == 'darkgreen'], axis = 0)))
model_umap = UMAP(learning_rate = 1, n_components = 2, min_dist = 2, n_neighbors = 500,
init = X_reduced, n_epochs = 1000, verbose = 0, spread = 2)
umap = model_umap.fit_transform(X)
X_umap_centroids = np.vstack((np.mean(umap[y == 'brown'], axis = 0),
np.mean(umap[y == 'red'], axis = 0),
np.mean(umap[y == 'darkorange'], axis = 0),
np.mean(umap[y == 'blue'], axis = 0),
np.mean(umap[y == 'darkgreen'], axis = 0)))
#from sklearn.metrics.pairwise import pairwise_distances
#np.square(pairwise_distances(X_centroids, X_centroids, metric = 'mahalanobis'))
dist_centroids_orig = np.square(euclidean_distances(X_centroids, X_centroids)).flatten()
dist_centroids_pca = np.square(euclidean_distances(X_pca_centroids, X_pca_centroids)).flatten()
dist_centroids_tsne = np.square(euclidean_distances(X_tsne_centroids, X_tsne_centroids)).flatten()
dist_centroids_umap = np.square(euclidean_distances(X_umap_centroids, X_umap_centroids)).flatten()
coef_centroids_pca, p_centroids_pca = spearmanr(dist_centroids_orig, dist_centroids_pca)
coef_centroids_tsne, p_centroids_tsne = spearmanr(dist_centroids_orig, dist_centroids_tsne)
coef_centroids_umap, p_centroids_umap = spearmanr(dist_centroids_orig, dist_centroids_umap)
coef_pca_centroids_list.append(coef_centroids_pca); coef_tsne_centroids_list.append(coef_centroids_tsne);
coef_umap_centroids_list.append(coef_centroids_umap)
print('Spearman correlation coeffcient for PCA dimension reduction = {}'.format(coef_centroids_pca))
print('Spearman correlation coeffcient for tSNE dimension reduction = {}'.format(coef_centroids_tsne))
print('Spearman correlation coeffcient for UMAP dimension reduction = {}'.format(coef_centroids_umap))
print('****************************************************************************')
import matplotlib
plt.figure(figsize = (20, 15))
matplotlib.rcParams.update({'font.size': 22})
plt.boxplot([coef_pca_centroids_list, coef_umap_centroids_list, coef_tsne_centroids_list],
labels = ['PCA', 'UMAP', 'tSNE'], patch_artist = True)
plt.title('Preservation of distances between centroids of clusters by dimension reduction algorithms',
fontsize = 22)
plt.ylabel('Spearman Correlation coefficient', fontsize = 24)
plt.show()
We conclude that we do not observe a significant different in distance preservation between centroids of clusters by UMAP and tSNE, PCA of course perfectly preserved the distances between centroids. Here we observe a very interesting lack of stochasticity for tSNE, i.e. the result becomes deterministic (the same from run to run) when initialization is not random. In contrast, initialization is not the only source of stocasticity for UMAP but it also comes from the stocastic gradient descent. Here we see that for a non-random initialization, the result of UMAP still varies from run to run because of stochastic gradient descent.
Measuring distances between centroids, i.e. ignoring the variation in the data, is perhaps not a very good idea since the clusters are elogated. Therefore a better idea is to compute Mahalanobis distances between all pairs of clusters. Mahalanobis distance first calculates distances between each point of one cluster to the centroid of the second cluster and normalizes those distances by the "thickness" of the variation (assuming the clustesr have ellipsoidal symmetry) in the second cluster.
import matplotlib
from scipy.stats import spearmanr
from sklearn.metrics.pairwise import pairwise_distances
figure = plt.figure(figsize = (20, 15))
matplotlib.rcParams.update({'font.size': 22})
clusters = list(set(y))
clust_dict = {'brown': 'North America', 'red': 'South America', 'blue': 'Eurasia',
'darkgreen': 'Africa', 'darkorange': 'Australia'}
coef_tsne_all = []; coef_umap_all = []
for clust_index, j in enumerate(clusters):
coef_tsne = []; coef_umap = []
for i in clusters:
if(i!=j):
orig_dist = np.square(pairwise_distances(X[y == j], X[y == i], metric = 'mahalanobis')).flatten()
tsne_dist = np.square(pairwise_distances(tsne[y == j],
tsne[y == i], metric = 'mahalanobis')).flatten()
umap_dist = np.square(pairwise_distances(umap[y == j],
umap[y == i], metric = 'mahalanobis')).flatten()
coef_tsne_current, _ = spearmanr(orig_dist, tsne_dist)
coef_umap_current, _ = spearmanr(orig_dist, umap_dist)
coef_tsne.append(np.abs(coef_tsne_current))
coef_umap.append(np.abs(coef_umap_current))
plt.subplot(321 + clust_index)
plt.boxplot([coef_tsne, coef_umap], labels = ['tSNE', 'UMAP'], patch_artist = True)
plt.title('Distances Between {} and Others'.format(clust_dict[j]), fontsize = 22)
plt.ylabel('Spearman Rho', fontsize = 24)
coef_tsne_all.append(coef_tsne)
coef_umap_all.append(coef_umap)
coef_tsne_all = [y for x in coef_tsne_all for y in x]
coef_umap_all = [y for x in coef_umap_all for y in x]
plt.subplot(326)
plt.boxplot([coef_tsne_all, coef_umap_all], labels = ['tSNE', 'UMAP'], patch_artist = True)
plt.title('Distances Between All Pairs of Clusters', fontsize = 22)
plt.ylabel('Spearman Rho', fontsize = 24)
figure.tight_layout()
plt.show()
from scipy.stats import mannwhitneyu
stat, p = mannwhitneyu(coef_tsne_all, coef_umap_all)
print('Statistics = %.3f, p = %.3f' % (stat, p))
Here we observe that while being somewhat unclear for South America, Australia and Africa, at least for North America and Eurasia, UMAP preserves the original mahalanobis distance between those clustesr and the other ones much better than tSNE. Averaging across all clusters (the last plot in the figure above) and performing a Mann-Whittney U test, we demonstrate that indeed UMAP significantly better preserves original Mahalanobis distances between the continets / clusters.
Let us now estimate how well the shape of the clusters is preserved by the different dimension reduction methods. By shape of the clusters we understand the sizes (heigh and width) of the bounding box wrapped around the clusters. The scale of the bounding boxes is changed a lot during tSNE and UMAP dimension reduction, i.e. it can be stratched or squeezed, however both width and height should change proportionally towards each other (keeping its scaling ratio) and towards the initial dimension of the bounding box. Therefore we are going to use Spearman correlation coefficient between original and reconstructed sizes of the bounding boxes as a criterion of how well the dimension reduction algorithms can preserve the shapes of the clusters.
import warnings
warnings.filterwarnings("ignore")
import random
from umap import UMAP
from sklearn.manifold import TSNE
from scipy.stats import spearmanr
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import euclidean_distances
coef_sizes_pca_list = []; coef_sizes_tsne_list = []; coef_sizes_umap_list = []
for i in range(10):
print('Working with iteration {}'.format(i + 1))
NorthAmerica_bb_coords = [np.min(X_NorthAmerica[:,0]), np.max(X_NorthAmerica[:,0]),
np.min(X_NorthAmerica[:,1]), np.max(X_NorthAmerica[:,1])]
NorthAmerica_bb_sizes = np.array([NorthAmerica_bb_coords[1] - NorthAmerica_bb_coords[0],
NorthAmerica_bb_coords[3] - NorthAmerica_bb_coords[2]])
SouthAmerica_bb_coords = [np.min(X_SouthAmerica[:,0]), np.max(X_SouthAmerica[:,0]),
np.min(X_SouthAmerica[:,1]), np.max(X_SouthAmerica[:,1])]
SouthAmerica_bb_sizes = np.array([SouthAmerica_bb_coords[1] - SouthAmerica_bb_coords[0],
SouthAmerica_bb_coords[3] - SouthAmerica_bb_coords[2]])
Australia_bb_coords = [np.min(X_Australia[:,0]), np.max(X_Australia[:,0]),
np.min(X_Australia[:,1]), np.max(X_Australia[:,1])]
Australia_bb_sizes = np.array([Australia_bb_coords[1] - Australia_bb_coords[0],
Australia_bb_coords[3] - Australia_bb_coords[2]])
Eurasia_bb_coords = [np.min(X_Eurasia[:,0]), np.max(X_Eurasia[:,0]),
np.min(X_Eurasia[:,1]), np.max(X_Eurasia[:,1])]
Eurasia_bb_sizes = np.array([Eurasia_bb_coords[1] - Eurasia_bb_coords[0],
Eurasia_bb_coords[3] - Eurasia_bb_coords[2]])
Africa_bb_coords = [np.min(X_Africa[:,0]), np.max(X_Africa[:,0]),
np.min(X_Africa[:,1]), np.max(X_Africa[:,1])]
Africa_bb_sizes = np.array([Africa_bb_coords[1] - Africa_bb_coords[0],
Africa_bb_coords[3] - Africa_bb_coords[2]])
orig_bb_sizes = np.vstack((NorthAmerica_bb_sizes, SouthAmerica_bb_sizes, Australia_bb_sizes,
Eurasia_bb_sizes, Africa_bb_sizes)).flatten()
X_reduced = PCA(n_components = 2).fit_transform(X)
NorthAmerica_pca_bb_coords = [np.min(X_reduced[y == 'brown'][:,0]), np.max(X_reduced[y == 'brown'][:,0]),
np.min(X_reduced[y == 'brown'][:,1]), np.max(X_reduced[y == 'brown'][:,1])]
NorthAmerica_pca_bb_sizes = np.array([NorthAmerica_pca_bb_coords[1] - NorthAmerica_pca_bb_coords[0],
NorthAmerica_pca_bb_coords[3] - NorthAmerica_pca_bb_coords[2]])
SouthAmerica_pca_bb_coords = [np.min(X_reduced[y == 'red'][:,0]), np.max(X_reduced[y == 'red'][:,0]),
np.min(X_reduced[y == 'red'][:,1]), np.max(X_reduced[y == 'red'][:,1])]
SouthAmerica_pca_bb_sizes = np.array([SouthAmerica_pca_bb_coords[1] - SouthAmerica_pca_bb_coords[0],
SouthAmerica_pca_bb_coords[3] - SouthAmerica_pca_bb_coords[2]])
Australia_pca_bb_coords = [np.min(X_reduced[y == 'darkorange'][:,0]),
np.max(X_reduced[y == 'darkorange'][:,0]),
np.min(X_reduced[y == 'darkorange'][:,1]),
np.max(X_reduced[y == 'darkorange'][:,1])]
Australia_pca_bb_sizes = np.array([Australia_pca_bb_coords[1] - Australia_pca_bb_coords[0],
Australia_pca_bb_coords[3] - Australia_pca_bb_coords[2]])
Eurasia_pca_bb_coords = [np.min(X_reduced[y == 'blue'][:,0]), np.max(X_reduced[y == 'blue'][:,0]),
np.min(X_reduced[y == 'blue'][:,1]), np.max(X_reduced[y == 'blue'][:,1])]
Eurasia_pca_bb_sizes = np.array([Eurasia_pca_bb_coords[1] - Eurasia_pca_bb_coords[0],
Eurasia_pca_bb_coords[3] - Eurasia_pca_bb_coords[2]])
Africa_pca_bb_coords = [np.min(X_reduced[y == 'darkgreen'][:,0]), np.max(X_reduced[y == 'darkgreen'][:,0]),
np.min(X_reduced[y == 'darkgreen'][:,1]), np.max(X_reduced[y == 'darkgreen'][:,1])]
Africa_pca_bb_sizes = np.array([Africa_pca_bb_coords[1] - Africa_pca_bb_coords[0],
Africa_pca_bb_coords[3] - Africa_pca_bb_coords[2]])
pca_bb_sizes = np.vstack((NorthAmerica_pca_bb_sizes, SouthAmerica_pca_bb_sizes,
Australia_pca_bb_sizes, Eurasia_pca_bb_sizes, Africa_pca_bb_sizes)).flatten()
model_tsne = TSNE(learning_rate = 200, n_components = 2, perplexity = 500,
init = X_reduced, n_iter = 1000, verbose = 0)
tsne = model_tsne.fit_transform(X)
NorthAmerica_tsne_bb_coords = [np.min(tsne[y == 'brown'][:,0]), np.max(tsne[y == 'brown'][:,0]),
np.min(tsne[y == 'brown'][:,1]), np.max(tsne[y == 'brown'][:,1])]
NorthAmerica_tsne_bb_sizes = np.array([NorthAmerica_tsne_bb_coords[1] - NorthAmerica_tsne_bb_coords[0],
NorthAmerica_tsne_bb_coords[3] - NorthAmerica_tsne_bb_coords[2]])
SouthAmerica_tsne_bb_coords = [np.min(tsne[y == 'red'][:,0]), np.max(tsne[y == 'red'][:,0]),
np.min(tsne[y == 'red'][:,1]), np.max(tsne[y == 'red'][:,1])]
SouthAmerica_tsne_bb_sizes = np.array([SouthAmerica_tsne_bb_coords[1] - SouthAmerica_tsne_bb_coords[0],
SouthAmerica_tsne_bb_coords[3] - SouthAmerica_tsne_bb_coords[2]])
Australia_tsne_bb_coords = [np.min(tsne[y == 'darkorange'][:,0]), np.max(tsne[y == 'darkorange'][:,0]),
np.min(tsne[y == 'darkorange'][:,1]), np.max(tsne[y == 'darkorange'][:,1])]
Australia_tsne_bb_sizes = np.array([Australia_tsne_bb_coords[1] - Australia_tsne_bb_coords[0],
Australia_tsne_bb_coords[3] - Australia_tsne_bb_coords[2]])
Eurasia_tsne_bb_coords = [np.min(tsne[y == 'blue'][:,0]), np.max(tsne[y == 'blue'][:,0]),
np.min(tsne[y == 'blue'][:,1]), np.max(tsne[y == 'blue'][:,1])]
Eurasia_tsne_bb_sizes = np.array([Eurasia_tsne_bb_coords[1] - Eurasia_tsne_bb_coords[0],
Eurasia_tsne_bb_coords[3] - Eurasia_tsne_bb_coords[2]])
Africa_tsne_bb_coords = [np.min(tsne[y == 'darkgreen'][:,0]), np.max(tsne[y == 'darkgreen'][:,0]),
np.min(tsne[y == 'darkgreen'][:,1]), np.max(tsne[y == 'darkgreen'][:,1])]
Africa_tsne_bb_sizes = np.array([Africa_tsne_bb_coords[1] - Africa_tsne_bb_coords[0],
Africa_tsne_bb_coords[3] - Africa_tsne_bb_coords[2]])
tsne_bb_sizes = np.vstack((NorthAmerica_tsne_bb_sizes, SouthAmerica_tsne_bb_sizes,
Australia_tsne_bb_sizes, Eurasia_tsne_bb_sizes, Africa_tsne_bb_sizes)).flatten()
model_umap = UMAP(learning_rate = 1, n_components = 2, min_dist = 2, n_neighbors = 500,
init = X_reduced, n_epochs = 1000, verbose = 0, spread = 2)
umap = model_umap.fit_transform(X)
NorthAmerica_umap_bb_coords = [np.min(umap[y == 'brown'][:,0]), np.max(umap[y == 'brown'][:,0]),
np.min(umap[y == 'brown'][:,1]), np.max(umap[y == 'brown'][:,1])]
NorthAmerica_umap_bb_sizes = np.array([NorthAmerica_umap_bb_coords[1] - NorthAmerica_umap_bb_coords[0],
NorthAmerica_umap_bb_coords[3] - NorthAmerica_umap_bb_coords[2]])
SouthAmerica_umap_bb_coords = [np.min(umap[y == 'red'][:,0]), np.max(umap[y == 'red'][:,0]),
np.min(umap[y == 'red'][:,1]), np.max(umap[y == 'red'][:,1])]
SouthAmerica_umap_bb_sizes = np.array([SouthAmerica_umap_bb_coords[1] - SouthAmerica_umap_bb_coords[0],
SouthAmerica_umap_bb_coords[3] - SouthAmerica_umap_bb_coords[2]])
Australia_umap_bb_coords = [np.min(umap[y == 'darkorange'][:,0]), np.max(umap[y == 'darkorange'][:,0]),
np.min(umap[y == 'darkorange'][:,1]), np.max(umap[y == 'darkorange'][:,1])]
Australia_umap_bb_sizes = np.array([Australia_umap_bb_coords[1] - Australia_umap_bb_coords[0],
Australia_umap_bb_coords[3] - Australia_umap_bb_coords[2]])
Eurasia_umap_bb_coords = [np.min(umap[y == 'blue'][:,0]), np.max(umap[y == 'blue'][:,0]),
np.min(umap[y == 'blue'][:,1]), np.max(umap[y == 'blue'][:,1])]
Eurasia_umap_bb_sizes = np.array([Eurasia_umap_bb_coords[1] - Eurasia_umap_bb_coords[0],
Eurasia_umap_bb_coords[3] - Eurasia_umap_bb_coords[2]])
Africa_umap_bb_coords = [np.min(umap[y == 'darkgreen'][:,0]), np.max(umap[y == 'darkgreen'][:,0]),
np.min(umap[y == 'darkgreen'][:,1]), np.max(umap[y == 'darkgreen'][:,1])]
Africa_umap_bb_sizes = np.array([Africa_umap_bb_coords[1] - Africa_umap_bb_coords[0],
Africa_umap_bb_coords[3] - Africa_umap_bb_coords[2]])
umap_bb_sizes = np.vstack((NorthAmerica_umap_bb_sizes, SouthAmerica_umap_bb_sizes,
Australia_umap_bb_sizes, Eurasia_umap_bb_sizes, Africa_umap_bb_sizes)).flatten()
coef_sizes_pca, p_sizes_pca = spearmanr(orig_bb_sizes, pca_bb_sizes)
coef_sizes_tsne, p_sizes_tsne = spearmanr(orig_bb_sizes, tsne_bb_sizes)
coef_sizes_umap, p_sizes_umap = spearmanr(orig_bb_sizes, umap_bb_sizes)
coef_sizes_pca_list.append(coef_sizes_pca)
coef_sizes_tsne_list.append(coef_sizes_tsne)
coef_sizes_umap_list.append(coef_sizes_umap)
figure = plt.figure(figsize = (20, 6))
plt.subplot(131)
plt.scatter(orig_bb_sizes, pca_bb_sizes)
plt.xlabel('Original Sizes'); plt.ylabel('Reconstructed Sizes'); plt.title('PCA')
plt.subplot(132)
plt.scatter(orig_bb_sizes, tsne_bb_sizes)
plt.xlabel('Original Sizes'); plt.ylabel('Reconstructed Sizes'); plt.title('tSNE')
plt.subplot(133)
plt.scatter(orig_bb_sizes, umap_bb_sizes)
plt.xlabel('Original Sizes'); plt.ylabel('Reconstructed Sizes'); plt.title('UMAP')
figure.tight_layout()
plt.show()
print('Spearman correlation coeffcient for PCA dimension reduction = {}'.format(coef_sizes_pca))
print('Spearman correlation coeffcient for tSNE dimension reduction = {}'.format(coef_sizes_tsne))
print('Spearman correlation coeffcient for UMAP dimension reduction = {}'.format(coef_sizes_umap))
print('****************************************************************************')
import matplotlib
plt.figure(figsize = (20, 15))
matplotlib.rcParams.update({'font.size': 22})
plt.boxplot([coef_sizes_pca_list, coef_sizes_umap_list, coef_sizes_tsne_list],
labels = ['PCA', 'UMAP', 'tSNE'], patch_artist = True)
plt.title('Preserving shapes of clusters by dimension reduction algorithms', fontsize = 22)
plt.ylabel('Spearman Correlation coefficient', fontsize = 24)
plt.show()
from scipy.stats import mannwhitneyu
stat, p = mannwhitneyu(coef_sizes_umap_list, coef_sizes_tsne_list)
print('Statistics = %.3f, p = %.3f' % (stat, p))
PCA of course demonstrates a perfect correlation between original and reconstructed sizes of the bounding boxes around the clusters. Regarding tSNE vs. UMAP, despite large variation in the UMAP Spearman correlation coefficient between original and reconstructed cluster sizes, we conclude that UMAP significantly better preserves the shapes / sizes of the clusters which is another confirmation of better global structure preservation by UMAP compared to tSNE.
First let us construct a sintetic data set. We will use the fantastic In-Depth Manifold Learning tutorial and start with a word projected to a linear manifold without added noise, later we are going to project the word onto a non-linear manifold such as S-curve or swiss roll.
To construct the sintetic data set we will use the fantastic In-Depth Manifold Learning tutorial and start with a 2D collection of points.
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.image import imread
N = 10000
fig, ax = plt.subplots(figsize=(10, 1))
fig.subplots_adjust(left=0, right=1, bottom=0, top=1)
ax.axis('off')
ax.text(0.5, 0.4, 'H E L L O', va = 'center', ha = 'center', weight = 'bold', size = 85)
fig.savefig('hello.png')
plt.close(fig)
data = imread('hello.png')[::-1, :, 0].T
rng = np.random.RandomState(123)
X = rng.rand(4 * N, 2)
i, j = (X * data.shape).astype(int).T
mask = (data[i, j] < 1)
X = X[mask]
X[:, 0] *= (data.shape[0] / data.shape[1])
X = X[:N]
X = X[np.argsort(X[:, 0])]
plt.figure(figsize=(20,15))
plt.scatter(X[:, 0], X[:, 1], c = X[:, 0], cmap = plt.cm.get_cmap('rainbow', 5), s = 50)
plt.axis('equal');
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
def make_hello(N=1000, rseed=42):
# Make a plot with "HELLO" text; save as PNG
fig, ax = plt.subplots(figsize=(4, 1))
fig.subplots_adjust(left=0, right=1, bottom=0, top=1)
ax.axis('off')
ax.text(0.5, 0.4, 'HELLO', va='center', ha='center', weight='bold', size=85)
fig.savefig('hello.png')
plt.close(fig)
# Open this PNG and draw random points from it
from matplotlib.image import imread
data = imread('hello.png')[::-1, :, 0].T
rng = np.random.RandomState(rseed)
X = rng.rand(4 * N, 2)
i, j = (X * data.shape).astype(int).T
mask = (data[i, j] < 1)
X = X[mask]
X[:, 0] *= (data.shape[0] / data.shape[1])
X = X[:N]
return X[np.argsort(X[:, 0])]
X = make_hello(1000)
plt.figure(figsize=(20,15))
plt.scatter(X[:, 0], X[:, 1], c=X[:, 0], cmap=plt.cm.get_cmap('rainbow', 5), s = 50)
plt.axis('equal');
This is how the synetic data looks like, it is just a 2D numpy array containing 1000 data points.
X
Now let us perform the linear dimension reduction with the Principal Component Analysis (PCA) and Multi-Dimensional Scaling (MDS) and see whether PCA and MDS will able to reconstruct the initial data set via the linear matrix decomposition.
from sklearn.decomposition import PCA
X_reduced = PCA(n_components = 2).fit_transform(X)
plt.figure(figsize=(20,15))
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c = X[:, 0], cmap = plt.cm.get_cmap('rainbow', 5), s = 50)
plt.title('Principal Component Analysis (PCA)', fontsize = 20)
plt.xlabel("PCA1", fontsize = 20)
plt.ylabel("PCA2", fontsize = 20)
plt.show()
from sklearn.manifold import MDS
model_mds = MDS(n_components = 2, random_state = 123, metric = True)
mmds = model_mds.fit_transform(X)
plt.figure(figsize=(20,15))
plt.scatter(mmds[:, 0], mmds[:, 1], c = X[:, 0], cmap = plt.cm.get_cmap('rainbow', 5), s = 50)
plt.title('Metric Multi-Dimensional Scaling (MMDS)', fontsize = 20)
plt.xlabel("MMDS1", fontsize = 20)
plt.ylabel("MMDS2", fontsize = 20)
plt.show()
We conclude that the PCA and MDS perfectly reconstructed the original data, this is not surprising because it is a 2D data and a linear manifold, i.e. only rotations, scaling and translations. For comparision we will check the Laplacian Eigenmaps dimension reduction plot.
from sklearn.manifold import SpectralEmbedding
model = SpectralEmbedding(n_components = 2)
se = model.fit_transform(X)
plt.figure(figsize=(20,15))
plt.scatter(se[:, 0], se[:, 1], c = X[:, 0], cmap = plt.cm.get_cmap('rainbow', 5), s = 50)
plt.title('Laplacian Eigenmap', fontsize = 20)
plt.xlabel("LAP1", fontsize = 20)
plt.ylabel("LAP2", fontsize = 20)
plt.show()
The Laplacian eigenmaps tend to group points from each cluster / letter together into a single point which explains why Spectral Clustering is such a powerful and popular technique. Since the Laplacian Eigenmaps basically produces very tighthly packed clusters, it is very easy to run any clustering algorithm, even K-means, on the top of the dimensions reduction.
Now it is time to run a non-linear dimension reduction such as tSNE. Here we will use a very large perplexity value, $\sqrt{N} \approx 30$ should be enough for a good balance between local and global structure. We will start with moderate perplexity = 30 and later we will use perplexity = 1000 in order to aim to fully reconstruct the origianl data set.
from sklearn.manifold import TSNE
X_reduced = PCA(n_components = 2).fit_transform(X)
model = TSNE(learning_rate = 1, n_components = 2, random_state = 123, perplexity = 30,
init = X_reduced, n_iter = 10000, verbose = 2, early_exaggeration = 1)
tsne = model.fit_transform(X)
plt.figure(figsize=(20,15))
plt.scatter(tsne[:, 0], tsne[:, 1], c = X[:, 0], cmap = plt.cm.get_cmap('rainbow', 5), s = 50)
plt.title('tSNE', fontsize = 20)
plt.xlabel("tSNE1", fontsize = 20)
plt.ylabel("tSNE2", fontsize = 20)
plt.show()
Looks like tSNE almost reconstructed the original data. Let us check how UMAP performs on this data set:
import warnings
warnings.filterwarnings("ignore")
from umap import UMAP
X_reduced = PCA(n_components = 2).fit_transform(X)
model = UMAP(learning_rate = 1, n_components = 2, min_dist = 1, n_neighbors = 998,
init = X_reduced, n_epochs = 10000, verbose = 2)
umap = model.fit_transform(X)
plt.figure(figsize=(20,15))
plt.scatter(umap[:, 0], umap[:, 1], c = X[:, 0], cmap = plt.cm.get_cmap('rainbow', 5), s = 50)
plt.title('UMAP', fontsize = 20)
plt.xlabel("UMAP1", fontsize = 20)
plt.ylabel("UMAP2", fontsize = 20)
plt.show()
We conclude that both tSNE and UMAP can reconstruct the original data. To achieve this, tSNE needs a large learning rate and perplexity close to the size of the data set, while UMAP needs a small learning rate, min_dist = 1 and large (almost infinite) n_neighbors.
Here we will plot derivative of tSNE cost function that is the KL-divergence, and UMAP cost function that is the Cross-Entropy (CE).
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import euclidean_distances
N_LOW_DIMS = 2
MAX_ITER = 1000
PERPLEXITY = 200
LEARNING_RATE = 0.6
X_train = X; n = X_train.shape[0]
y_train = X[:, 0]
dist = np.square(euclidean_distances(X_train, X_train))
plt.figure(figsize=(20,15))
sns.distplot(dist.reshape(-1,1))
plt.title("EUCLIDEAN DISTANCES", fontsize = 20)
plt.xlabel("EUCLIDEAN DISTANCE", fontsize = 20); plt.ylabel("FREQUENCY", fontsize = 20)
plt.show()
def prob_high_dim(sigma, dist_row):
exp_distance = np.exp(-dist[dist_row] / (2*sigma**2))
exp_distance[dist_row] = 0
prob_not_symmetr = exp_distance / np.sum(exp_distance)
return prob_not_symmetr
def perplexity(prob):
return np.power(2, -np.sum([p*np.log2(p) for p in prob if p!=0]))
#return -np.sum([p*np.log2(p) for p in prob if p!=0])
def sigma_binary_search(perp_of_sigma, fixed_perplexity):
sigma_lower_limit = 0; sigma_upper_limit = 1000
for i in range(20):
approx_sigma = (sigma_lower_limit + sigma_upper_limit) / 2
if perp_of_sigma(approx_sigma) < fixed_perplexity:
sigma_lower_limit = approx_sigma
else:
sigma_upper_limit = approx_sigma
if np.abs(fixed_perplexity - perp_of_sigma(approx_sigma)) <= 1e-5:
break
return approx_sigma
prob = np.zeros((n,n)); sigma_array = []
for dist_row in range(n):
func = lambda sigma: perplexity(prob_high_dim(sigma, dist_row))
binary_search_result = sigma_binary_search(func, PERPLEXITY)
prob[dist_row] = prob_high_dim(binary_search_result, dist_row)
sigma_array.append(binary_search_result)
if (dist_row + 1) % 100 == 0:
print("Sigma binary search finished {0} of {1} cells".format(dist_row + 1, n))
print("\nMean sigma = " + str(np.mean(sigma_array)))
plt.figure(figsize=(20,15))
sns.distplot(prob.reshape(-1,1))
plt.title("HIGH-DIMENSIONAL PROBABILITIES", fontsize = 20)
plt.xlabel("PROBABILITY", fontsize = 20); plt.ylabel("FREQUENCY", fontsize = 20)
plt.show()
plt.figure(figsize=(20,15))
sns.distplot(sigma_array)
plt.title("Histogram of Sigma values", fontsize = 20)
plt.xlabel("SIGMA", fontsize = 20)
plt.ylabel("FREQUENCY", fontsize = 20)
plt.show()
P = prob + np.transpose(prob)
def prob_low_dim(Y):
inv_distances = np.power(1 + np.square(euclidean_distances(Y, Y)), -1)
np.fill_diagonal(inv_distances, 0.)
return inv_distances / np.sum(inv_distances, axis = 1, keepdims = True)
def KL(P, Y):
Q = prob_low_dim(Y)
return P * np.log(P + 0.01) - P * np.log(Q + 0.01)
def KL_gradient(P, Y):
Q = prob_low_dim(Y)
y_diff = np.expand_dims(Y, 1) - np.expand_dims(Y, 0)
inv_dist = np.power(1 + np.square(euclidean_distances(Y, Y)), -1)
return 4*np.sum(np.expand_dims(P - Q, 2) * y_diff * np.expand_dims(inv_dist, 2), axis = 1)
np.random.seed(12345)
#y = np.random.normal(loc = 0, scale = 1, size = (n, N_LOW_DIMS))
y = X_reduced
KL_array = []; KL_gradient_array = []
print("Running Gradient Descent: \n")
for i in range(MAX_ITER):
y = y - LEARNING_RATE * KL_gradient(P, y)
KL_array.append(np.sum(KL(P, y)))
KL_gradient_array.append(np.sum(KL_gradient(P, y)))
if i % 100 == 0:
print("KL divergence = " + str(np.sum(KL(P, y))))
plt.figure(figsize=(20,15))
plt.plot(KL_array,'-o')
plt.title("KL-divergence", fontsize = 20)
plt.xlabel("ITERATION", fontsize = 20); plt.ylabel("KL-DIVERGENCE", fontsize = 20)
plt.show()
plt.figure(figsize=(20,15))
plt.plot(KL_gradient_array,'-o')
plt.title("KL-divergence Gradient", fontsize = 20)
plt.xlabel("ITERATION", fontsize = 20); plt.ylabel("KL-DIVERGENCE GRADIENT", fontsize = 20)
plt.show()
plt.figure(figsize=(20,15))
sns.distplot(-np.sum(prob*np.log2(prob + 0.00001), axis=0))
#sns.distplot(np.power(2, -np.sum(prob*np.log2(prob+0.00001),axis=0)))
plt.show()
plt.figure(figsize=(20,15))
plt.scatter(y[:,0], y[:,1], c = X[:, 0], cmap = plt.cm.get_cmap('rainbow', 5), s = 50)
plt.title("tSNE on a syntetic data set", fontsize = 20)
plt.xlabel("tSNE1", fontsize = 20); plt.ylabel("tSNE2", fontsize = 20)
plt.show()
Let us figure out how sigma is connected with perplexity for tSNE:
import numpy as np
import pandas as pd
from scipy import optimize
import matplotlib.pyplot as plt
from sklearn.manifold import SpectralEmbedding
from sklearn.metrics.pairwise import euclidean_distances
path = '/home/nikolay/WABI/K_Pietras/Manifold_Learning/'
expr = pd.read_csv(path + 'bartoschek_filtered_expr_rpkm.txt', sep='\t')
print(expr.iloc[0:4,0:4])
X_train = expr.values[:,0:(expr.shape[1]-1)]
X_train = np.log(X_train + 1)
n = X_train.shape[0]
print("\nThis data set contains " + str(n) + " samples")
y_train = expr.values[:,expr.shape[1]-1]
print("\nDimensions of the data set: ")
print(X_train.shape, y_train.shape)
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import euclidean_distances
my_perp = []; my_sigma_tSNE = []
for PERPLEXITY in range(4,724,10):
#X_train = X; n = X_train.shape[0]
#y_train = X[:, 0]
dist = np.square(euclidean_distances(X_train, X_train))
def prob_high_dim(sigma, dist_row):
exp_distance = np.exp(-dist[dist_row] / (2*sigma**2))
exp_distance[dist_row] = 0
prob_not_symmetr = exp_distance / np.sum(exp_distance)
return prob_not_symmetr
def perplexity(prob):
return np.power(2, -np.sum([p*np.log2(p) for p in prob if p!=0]))
def sigma_binary_search(perp_of_sigma, fixed_perplexity):
sigma_lower_limit = 0; sigma_upper_limit = 1000
for i in range(20):
approx_sigma = (sigma_lower_limit + sigma_upper_limit) / 2
if perp_of_sigma(approx_sigma) < fixed_perplexity:
sigma_lower_limit = approx_sigma
else:
sigma_upper_limit = approx_sigma
if np.abs(fixed_perplexity - perp_of_sigma(approx_sigma)) <= 1e-5:
break
return approx_sigma
prob = np.zeros((n,n)); sigma_array = []
for dist_row in range(n):
func = lambda sigma: perplexity(prob_high_dim(sigma, dist_row))
binary_search_result = sigma_binary_search(func, PERPLEXITY)
prob[dist_row] = prob_high_dim(binary_search_result, dist_row)
sigma_array.append(binary_search_result)
print("Perplexity = {0}, Mean Sigma = {1}".format(PERPLEXITY, np.mean(sigma_array)))
my_perp.append(PERPLEXITY)
my_sigma_tSNE.append(np.mean(sigma_array))
plt.figure(figsize=(20,15))
plt.plot(my_perp, my_sigma_tSNE, '-o')
plt.title("tSNE: Mean Sigma vs. Perplexity", fontsize = 20)
plt.xlabel("PERPLEXITY", fontsize = 20)
plt.ylabel("MEAN SIGMA", fontsize = 20)
plt.show()
plt.figure(figsize=(20,15))
sns.distplot(dist.reshape(-1,1))
plt.show()
plt.figure(figsize=(20,15))
sns.distplot(prob.reshape(-1,1))
plt.show()
from scipy import optimize
import matplotlib.pyplot as plt
import numpy as np
N = 1000
perp = np.array([5,10,20,50,100,200,300,400,500,600,700,800,900,920,950,980,990,995,998])
sigma_exact = np.array([0.032,0.05,0.078,0.15,0.24,0.38,0.52,0.68,0.88,1.1,1.35,1.65,2.15,
2.31,2.68,3.5,4.26,5.26,7.5])
sigma = lambda perp, a, b, c: ((a*perp) / N) / (1 - c*((perp) / N)**b)
p , _ = optimize.curve_fit(sigma, perp, sigma_exact)
print(p)
plt.figure(figsize=(20,15))
plt.plot(perp, sigma_exact, "o")
plt.plot(perp, sigma(perp, p[0], p[1], p[2]), c = "red")
plt.title("Non-Linear Least Square Fit", fontsize = 20)
plt.gca().legend(('Original', 'Fit'), fontsize = 20)
plt.xlabel("X", fontsize = 20); plt.ylabel("Y", fontsize = 20)
plt.show()
Let us figure out how sigma is connected with perplexity for UMAP:
import numpy as np
import pandas as pd
from scipy import optimize
import matplotlib.pyplot as plt
from sklearn.manifold import SpectralEmbedding
from sklearn.metrics.pairwise import euclidean_distances
path = '/home/nikolay/WABI/K_Pietras/Manifold_Learning/'
expr = pd.read_csv(path + 'bartoschek_filtered_expr_rpkm.txt', sep='\t')
print(expr.iloc[0:4,0:4])
X_train = expr.values[:,0:(expr.shape[1]-1)]
X_train = np.log(X_train + 1)
n = X_train.shape[0]
print("\nThis data set contains " + str(n) + " samples")
y_train = expr.values[:,expr.shape[1]-1]
print("\nDimensions of the data set: ")
print(X_train.shape, y_train.shape)
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import euclidean_distances
my_n_neighbor = []; my_sigma_umap = []
for N_NEIGHBOR in range(4,724,10):
#X_train = X; n = X_train.shape[0]
#y_train = X[:, 0]
dist = np.square(euclidean_distances(X_train, X_train))
rho = [sorted(dist[i])[1] for i in range(dist.shape[0])]
def prob_high_dim(sigma, dist_row):
d = dist[dist_row] - rho[dist_row]
d[d < 0] = 0
return np.exp(- d / sigma)
def k(prob):
return np.power(2, np.sum(prob))
def sigma_binary_search(k_of_sigma, fixed_k):
sigma_lower_limit = 0; sigma_upper_limit = 1000
for i in range(20):
approx_sigma = (sigma_lower_limit + sigma_upper_limit) / 2
if k_of_sigma(approx_sigma) < fixed_k:
sigma_lower_limit = approx_sigma
else:
sigma_upper_limit = approx_sigma
if np.abs(fixed_k - k_of_sigma(approx_sigma)) <= 1e-5:
break
return approx_sigma
prob = np.zeros((n,n)); sigma_array = []
for dist_row in range(n):
func = lambda sigma: k(prob_high_dim(sigma, dist_row))
binary_search_result = sigma_binary_search(func, N_NEIGHBOR)
prob[dist_row] = prob_high_dim(binary_search_result, dist_row)
sigma_array.append(binary_search_result)
print("N_neighbor = {0}, Mean Sigma = {1}".format(N_NEIGHBOR, np.mean(sigma_array)))
my_n_neighbor.append(N_NEIGHBOR)
my_sigma_umap.append(np.mean(sigma_array))
plt.figure(figsize=(20,15))
plt.plot(my_n_neighbor, my_sigma_umap, '-o')
plt.title("UMAP: Mean Sigma vs. N_neighbor", fontsize = 20)
plt.xlabel("N_NEIGHBOR", fontsize = 20)
plt.ylabel("MEAN SIGMA", fontsize = 20)
plt.show()
Let us compare how Perplexity and N_neighbors hyperparameters behave for tSNE and UMAP, respectively, for the same data set where the Euclidean distances are fixed:
plt.figure(figsize=(20, 15))
plt.plot(my_perp, my_sigma_tSNE, '-o')
plt.plot(my_n_neighbor, my_sigma_umap, '-o')
plt.gca().legend(('tSNE','UMAP'), fontsize = 20)
plt.title("Sigma vs. Perplexity / N_Neighbors for tSNE / UMAP", fontsize = 20)
plt.xlabel("PERPLEXITY / N_NEIGHBOR", fontsize = 20); plt.ylabel("MEAN SIGMA", fontsize = 20)
plt.show()
my_sigma_tSNE_mod = [2*(i**2) for i in my_sigma_tSNE]
plt.figure(figsize=(20, 15))
plt.plot(my_perp, my_sigma_tSNE_mod, '-o')
plt.plot(my_n_neighbor, my_sigma_umap, '-o')
plt.gca().legend(('tSNE','UMAP'), fontsize = 20)
plt.title("Sigma vs. Perplexity / N_Neighbors for tSNE / UMAP", fontsize = 20)
plt.xlabel("PERPLEXITY / N_NEIGHBOR", fontsize = 20); plt.ylabel("MEAN SIGMA", fontsize = 20)
plt.xlim(0,500); plt.ylim(0,600)
plt.show()
plt.figure(figsize=(20,15))
sns.distplot(prob.reshape(-1,1))
plt.xlim(-0.1,0.5)
plt.show()